I have a 950k prompt, how do I not reprocess it every time? #18244
Replies: 1 comment 2 replies
-
|
I use curl requests to save and restore KV cache from/to server slots. For Here is a simple example, separated into two bash scripts for easier understanding. In this simple example I am not checking which slot I want saved, and consider only one slot exists (slot 0). You need to make sure only one slot exists with your server (with unified kv cache or some other method) or adapt the following scripts to check each slot and save the one that contains info. Save KV cache to #!/bin/bash
###########################################
########### EDIT THESE VARIABLES ##########
###########################################
SERVER_URL="http://localhost:10001" # base URL/IP of the server
CACHE_ROOT="$HOME/.cache/llamacpp" # Example of path where the cache *.bin files live (set with --slot-save-path)
CACHE_NAME="mycache.bin" # name of the file that will store the cache
SLOT_NUMBER=0 # Generally, if only using one slot, it is slot 0
###########################################
########## SIMPLE SAVE OPERATION ##########
###########################################
slot_url="${SERVER_URL}/slots/$SLOT_NUMBER"
payload=$(printf '{"filename":"%s"}' "$CACHE_NAME") # JSON payload – safe quoting via printf
# Perform the request
curl -X POST "${slot_url}?action=save" \
-H 'Content-Type: application/json' \
-d "$payload"Restore KV cache from #!/bin/bash
###########################################
########### EDIT THESE VARIABLES ##########
###########################################
SERVER_URL="http://localhost:10001" # base URL/IP of the server
CACHE_ROOT="$HOME/.cache/llamacpp" # Example of path where the cache *.bin files live (set with --slot-save-path)
CACHE_NAME="mycache.bin" # name of the file that will store the cache
SLOT_NUMBER=0 # Generally, if only using one slot, it is slot 0
###########################################
######## SIMPLE RESTORE OPERATION #########
###########################################
slot_url="${SERVER_URL}/slots/$SLOT_NUMBER"
payload=$(printf '{"filename":"%s"}' "$CACHE_NAME") # JSON payload – safe quoting via printf
# Perform the request
curl -X POST "${slot_url}?action=restore" \
-H 'Content-Type: application/json' \
-d "$payload"And remember to set the path to the folder holding the cached files (in the example above, it would be With this basic info you can build your own slot saving/restoring script, in python or any other language to match your needs. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I wanted to see what Nemotron-3-Nano-30B-A3B (1M context) could make of my entire codebase.
With a lot of trimming, I managed to fit everything into 950k tokens. But it would take several hours to process in llama.cpp.
I'm looking into llama.cpp's prompt caching features, so I could waste this huge processing time only once, then and reuse the result in future runs to ask any questions I want.
I see that llama-cli supports this feature perfectly: I could use
--prompt-cache entirecodebase.binthe very first time, wait for it to process the 950k tokens, then when it's done, close it. Then on subsequent runs I could do--prompt-cache entirecodebase.bin --prompt-cache-ro.However, llama-server does not seem to support anything like this. I really would prefer to use the web interface, especially since I want to let a novice I'm mentoring also ask it questions.
Do I have any other options here or am I stuck with llama-cli?
Beta Was this translation helpful? Give feedback.
All reactions