Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/ISSUE_TEMPLATE/019-bug-misc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ body:
- Documentation/Github
- libllama (core library)
- llama-cli
- llama-completion
- llama-server
- llama-bench
- llama-quantize
Expand Down
3 changes: 2 additions & 1 deletion .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,8 @@ Add `ggml-ci` to commit message to trigger heavy CI workloads on the custom CI i

### Built Executables (in `build/bin/`)
Primary tools:
- **`llama-cli`**: Main inference tool
- **`llama-cli`**: Main CLI tool
- **`llama-completion`**: Text completion tool
- **`llama-server`**: OpenAI-compatible HTTP server
- **`llama-quantize`**: Model quantization utility
- **`llama-perplexity`**: Model evaluation tool
Expand Down
2 changes: 1 addition & 1 deletion docs/backend/SYCL.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ SYCL backend supports Intel GPU Family:
*Notes:*

- **Memory**
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-cli`.
- The device memory is a limitation when running a large model. The loaded model size, *`llm_load_tensors: buffer_size`*, is displayed in the log when running `./bin/llama-completion`.
- Please make sure the GPU shared memory from the host is large enough to account for the model's size. For e.g. the *llama-2-7b.Q4_0* requires at least 8.0GB for integrated GPU and 4.0GB for discrete GPU.

- **Execution Unit (EU)**
Expand Down
1 change: 1 addition & 0 deletions docs/backend/hexagon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ To generate an installable "package" simply use cmake --install:
...
-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-bench
-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-cli
-- Installing: /workspace/pkg-adb/llama.cpp/bin/llama-completion
...
```

Expand Down
2 changes: 1 addition & 1 deletion docs/backend/hexagon/developer.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ M=gpt-oss-20b-Q4_0.gguf NDEV=4 D=HTP0,HTP1,HTP2,HTP3 P=surfing.txt scripts/snapd
...
LD_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
ADSP_LIBRARY_PATH=/data/local/tmp/llama.cpp/lib
GGML_HEXAGON_NDEV=4 ./bin/llama-cli --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf
GGML_HEXAGON_NDEV=4 ./bin/llama-completion --no-mmap -m /data/local/tmp/llama.cpp/../gguf/gpt-oss-20b-Q4_0.gguf
-t 4 --ctx-size 8192 --batch-size 128 -ctk q8_0 -ctv q8_0 -fa on -ngl 99 --device HTP0,HTP1,HTP2,HTP3 -no-cnv -f surfing.txt
...
llama_model_loader: - type f32: 289 tensors
Expand Down
8 changes: 4 additions & 4 deletions scripts/fetch_server_test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,21 +74,21 @@ def collect_hf_model_test_parameters(test_file) -> Generator[HuggingFaceModel, N
for m in models:
logging.info(f' - {m.hf_repo} / {m.hf_file}')

cli_path = os.environ.get(
completion_path = os.environ.get(
'LLAMA_CLI_BIN_PATH',
os.path.join(
os.path.dirname(__file__),
'../build/bin/Release/llama-cli.exe' if os.name == 'nt' else '../build/bin/llama-cli'))
'../build/bin/Release/llama-completion.exe' if os.name == 'nt' else '../build/bin/llama-completion'))

for m in models:
if '<' in m.hf_repo or (m.hf_file is not None and '<' in m.hf_file):
continue
if m.hf_file is not None and '-of-' in m.hf_file:
logging.warning(f'Skipping model at {m.hf_repo} / {m.hf_file} because it is a split file')
continue
logging.info(f'Using llama-cli to ensure model {m.hf_repo}/{m.hf_file} was fetched')
logging.info(f'Using llama-completion to ensure model {m.hf_repo}/{m.hf_file} was fetched')
cmd = [
cli_path,
completion_path,
'-hfr', m.hf_repo,
*([] if m.hf_file is None else ['-hff', m.hf_file]),
'-n', '1',
Expand Down