I have a 950k prompt, how do I not reprocess it every time? #18244

jerkstorecaller · 2025-12-21T05:10:36Z

jerkstorecaller
Dec 21, 2025

I wanted to see what Nemotron-3-Nano-30B-A3B (1M context) could make of my entire codebase.

With a lot of trimming, I managed to fit everything into 950k tokens. But it would take several hours to process in llama.cpp.

I'm looking into llama.cpp's prompt caching features, so I could waste this huge processing time only once, then and reuse the result in future runs to ask any questions I want.

I see that llama-cli supports this feature perfectly: I could use --prompt-cache entirecodebase.bin the very first time, wait for it to process the 950k tokens, then when it's done, close it. Then on subsequent runs I could do --prompt-cache entirecodebase.bin --prompt-cache-ro.

However, llama-server does not seem to support anything like this. I really would prefer to use the web interface, especially since I want to let a novice I'm mentoring also ask it questions.

Do I have any other options here or am I stuck with llama-cli?

abc-nix · 2025-12-21T07:22:10Z

abc-nix
Dec 21, 2025

I use curl requests to save and restore KV cache from/to server slots. For llama-server, set the --slot-save-path path to where you want the cached KV file to be stored.

Here is a simple example, separated into two bash scripts for easier understanding. In this simple example I am not checking which slot I want saved, and consider only one slot exists (slot 0). You need to make sure only one slot exists with your server (with unified kv cache or some other method) or adapt the following scripts to check each slot and save the one that contains info.

Save KV cache to mycache.bin from slot 0

#!/bin/bash
###########################################
########### EDIT THESE VARIABLES ##########
###########################################

SERVER_URL="http://localhost:10001"  # base URL/IP of the server
CACHE_ROOT="$HOME/.cache/llamacpp"   # Example of path where the cache *.bin files live (set with --slot-save-path)
CACHE_NAME="mycache.bin"             # name of the file that will store the cache
SLOT_NUMBER=0                        # Generally, if only using one slot, it is slot 0

###########################################
########## SIMPLE SAVE OPERATION ##########
###########################################

slot_url="${SERVER_URL}/slots/$SLOT_NUMBER"
payload=$(printf '{"filename":"%s"}' "$CACHE_NAME") # JSON payload – safe quoting via printf

# Perform the request
curl -X POST "${slot_url}?action=save" \
     -H 'Content-Type: application/json' \
     -d "$payload"

Restore KV cache from mycache.bin to slot 0

#!/bin/bash
###########################################
########### EDIT THESE VARIABLES ##########
###########################################

SERVER_URL="http://localhost:10001"  # base URL/IP of the server
CACHE_ROOT="$HOME/.cache/llamacpp"   # Example of path where the cache *.bin files live (set with --slot-save-path)
CACHE_NAME="mycache.bin"             # name of the file that will store the cache
SLOT_NUMBER=0                        # Generally, if only using one slot, it is slot 0

###########################################
######## SIMPLE RESTORE OPERATION #########
###########################################

slot_url="${SERVER_URL}/slots/$SLOT_NUMBER"
payload=$(printf '{"filename":"%s"}' "$CACHE_NAME") # JSON payload – safe quoting via printf

# Perform the request
curl -X POST "${slot_url}?action=restore" \
     -H 'Content-Type: application/json' \
     -d "$payload"

And remember to set the path to the folder holding the cached files (in the example above, it would be --slot-save-path $HOME/.cache/llamacpp). In the examples above, the script doesn't use the CACHE_ROOT variable, but I have written it there as a reminder.

With this basic info you can build your own slot saving/restoring script, in python or any other language to match your needs.

2 replies

jerkstorecaller Dec 21, 2025
Author

Awesome, thank you so much! A couple of more weird questions, in case you know the answer:

If I generate this file on a different (faster) computer , then download it to mine, do you think that will work? Same llama.cpp version, same GGUF.
I'm actually curious if it's stable enough to allow me to update llama.cpp without invalidating previous savefiles. Or do I really need to stick to "same PC, same llama.cpp build" for this feature?

abc-nix Dec 21, 2025

As long as it is the same model and quant, and possibly using (or not) the same flash attention, the KV cache file should be identical. So:

It should work, but the .bin file will be Gigabytes in size (I use GLM-4.6 with 64k context, and the KV cache is 36 GBs big). I don't know how slow it may be for you to copy it from one machine to another.
I have used the same .bin cache during various llama.cpp updates, and it continued to work (as long as the KV cache logic didn't change). For a stable model (one that no longer requires updates or fixes to the internal implementation in llama.cpp), it should continue to work.

I have yet to test processing the context with a high batch size (higher -ub value gives faster PP) , saving the cache, then restarting llama-server with smaller ubatch so that more layers can fit in the GPUs and then restore the KV cache. In my head, this should work and allow me to get the best PP speeds and then the best Token generation speeds in a hybrid setup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I have a 950k prompt, how do I not reprocess it every time? #18244

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

I have a 950k prompt, how do I not reprocess it every time? #18244

Uh oh!

jerkstorecaller Dec 21, 2025

Replies: 1 comment · 2 replies

Uh oh!

abc-nix Dec 21, 2025

Uh oh!

jerkstorecaller Dec 21, 2025 Author

Uh oh!

Uh oh!

abc-nix Dec 21, 2025

jerkstorecaller
Dec 21, 2025

Replies: 1 comment 2 replies

abc-nix
Dec 21, 2025

jerkstorecaller Dec 21, 2025
Author