You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tools/server/README.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -75,9 +75,9 @@ For the ful list of features, please refer to [server's changelog](https://githu
75
75
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggml-org/llama.cpp/issues/1437<br/>(env: LLAMA_ARG_NUMA) |
76
76
|`-dev, --device <dev1,dev2,..>`| comma-separated list of devices to use for offloading (none = don't offload)<br/>use --list-devices to see a list of available devices<br/>(env: LLAMA_ARG_DEVICE) |
77
77
|`--list-devices`| print list of available devices and exit |
78
-
|`--override-tensor, -ot <tensor name pattern>=<buffer type>,...`| override tensor buffer type |
79
-
|`--cpu-moe, -cmoe`| keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
80
-
|`--n-cpu-moe, -ncmoe N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
78
+
|`-ot, --override-tensor <tensor name pattern>=<buffer type>,...`| override tensor buffer type |
79
+
|`-cmoe, --cpu-moe`| keep all Mixture of Experts (MoE) weights in the CPU<br/>(env: LLAMA_ARG_CPU_MOE) |
80
+
|`-ncmoe, --n-cpu-moe N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU<br/>(env: LLAMA_ARG_N_CPU_MOE) |
81
81
|`-ngl, --gpu-layers, --n-gpu-layers N`| max. number of layers to store in VRAM (default: -1)<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
82
82
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs<br/>(env: LLAMA_ARG_SPLIT_MODE) |
83
83
|`-ts, --tensor-split N0,N1,N2,...`| fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1<br/>(env: LLAMA_ARG_TENSOR_SPLIT) |
@@ -156,8 +156,8 @@ For the ful list of features, please refer to [server's changelog](https://githu
156
156
| Argument | Explanation |
157
157
| -------- | ----------- |
158
158
|`--ctx-checkpoints, --swa-checkpoints N`| max number of context checkpoints to create per slot (default: 8)[(more info)](https://github.com/ggml-org/llama.cpp/pull/15293)<br/>(env: LLAMA_ARG_CTX_CHECKPOINTS) |
159
-
|`--cache-ram, -cram N`| set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
160
-
|`--kv-unified, -kvu`| use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
159
+
|`-cram, --cache-ram N`| set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 - disable)[(more info)](https://github.com/ggml-org/llama.cpp/pull/16391)<br/>(env: LLAMA_ARG_CACHE_RAM) |
160
+
|`-kvu, --kv-unified`| use single unified KV buffer shared across all sequences (default: enabled if number of slots is auto)<br/>(env: LLAMA_ARG_KV_UNIFIED) |
161
161
|`--context-shift, --no-context-shift`| whether to use context shift on infinite text generation (default: disabled)<br/>(env: LLAMA_ARG_CONTEXT_SHIFT) |
162
162
|`-r, --reverse-prompt PROMPT`| halt generation at PROMPT, return control in interactive mode<br/> |
163
163
|`-sp, --special`| special tokens output enabled (default: false) |
@@ -172,9 +172,9 @@ For the ful list of features, please refer to [server's changelog](https://githu
172
172
|`--mmproj-offload, --no-mmproj-offload`| whether to enable GPU offloading for multimodal projector (default: enabled)<br/>(env: LLAMA_ARG_MMPROJ_OFFLOAD) |
173
173
|`--image-min-tokens N`| minimum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MIN_TOKENS) |
174
174
|`--image-max-tokens N`| maximum number of tokens each image can take, only used by vision models with dynamic resolution (default: read from model)<br/>(env: LLAMA_ARG_IMAGE_MAX_TOKENS) |
175
-
|`--override-tensor-draft, -otd <tensor name pattern>=<buffer type>,...`| override tensor buffer type for draft model |
176
-
|`--cpu-moe-draft, -cmoed`| keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
177
-
|`--n-cpu-moe-draft, -ncmoed N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
175
+
|`-otd, --override-tensor-draft <tensor name pattern>=<buffer type>,...`| override tensor buffer type for draft model |
176
+
|`-cmoed, --cpu-moe-draft`| keep all Mixture of Experts (MoE) weights in the CPU for the draft model<br/>(env: LLAMA_ARG_CPU_MOE_DRAFT) |
177
+
|`-ncmoed, --n-cpu-moe-draft N`| keep the Mixture of Experts (MoE) weights of the first N layers in the CPU for the draft model<br/>(env: LLAMA_ARG_N_CPU_MOE_DRAFT) |
178
178
|`-a, --alias STRING`| set alias for model name (to be used by REST API)<br/>(env: LLAMA_ARG_ALIAS) |
179
179
|`--host HOST`| ip address to listen, or bind to an UNIX socket if the address ends with .sock (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
180
180
|`--port PORT`| port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
0 commit comments