[DRAFT] RFC: Native Embeddings via a Shared Adapter (CLI ↔ Server), plus CI E2E Baseline #16957

SamMalayek · 2025-11-03T04:17:42Z

SamMalayek
Nov 3, 2025

IN PROGRESS

Working on other stuff with timeline for weeks to come.

0) TL;DR

Add a minimal, opt-in native embeddings route in llama-server that uses the existing embedding compute API (llama_embedding_compute()), organized via the existing embedding TU at src/llama-embedding.{h,cpp}. Land a fast, deterministic E2E test suite that hardens shape/dtype/determinism and basic parallel safety. No default behavior changes in this phase.

1) Problem

llama-server exposes OpenAI-compat /v1/embeddings, but does not use first-party embedding kernels.
There is no E2E regression net for the embedding path → drift risk.

Goal: Ship a native server endpoint that uses the same underlying embedding compute API as the CLI/server implementation (llama_embedding_compute()), proven by a green, deterministic CI.

Non-goals (for this RFC/phase):

Changing defaults of OpenAI-compat routes.
Runtime refactors (scheduler, batching policy).
GPU/kernel perf work beyond functional correctness.
Introducing new high-level wrappers around embedding execution.

2) Approach

Use the existing C API entrypoint llama_embedding_compute() (already implemented) and standardize callsites (CLI + server) around the same low-level contract:

Both CLI and server:
- Build / reuse a llama_context with embeddings enabled.
- Tokenize input.
- Call llama_embedding_compute() over token arrays.
- Serialize to JSON (server always) / JSON or raw (CLI).

Code lives in the actual repo location src/llama-embedding.{h,cpp} to keep embedding-related declarations/glue consolidated without introducing a new “run_embedding(...)” wrapper.

3) Phases & Acceptance

Phase 0 — Groundwork (no behavior change)

Names/placement: src/llama-embedding.{h,cpp} (actual location).
Contracts:
- Server uses its existing config + context lifecycle.
- Request overrides are intentionally limited (see Section 4).
Determinism envelope (for tests): threads=1, fixed ctx size/model configured at server start, JSON float32 serialization.
Acceptance: Maintainer ACK on file location, naming, and request surface.

Phase 1 — CLI E2E Baseline (incomplete unless implemented in this effort)

Tests (tiny GGUF models): assert JSON schema, dim/dtype, deterministic replay (threads=1), cosine sim ≥ 0.999 for threads>1.
Runners: Linux required; Windows/macOS smoke (non-blocking).
Caching: stable cache key, strict timeouts; retries only for downloads.
Budget: < 2 min wall time. Unenforced.
Status: Either implement the full E2E suite in this effort or explicitly defer and track as follow-up work.
Acceptance (if implemented): green CI on master, artifacts include timings and dim.

Phase 2 — Native Server Endpoint (opt-in)

Route: POST /v2/embeddings behind existing --embeddings flag (reused).
Server context:
- Model, ctx size, seed, etc are configured at server start and live on the existing server context.
- Request only controls per-request threading + normalization mode (see Section 4).
Round-trip test: tiny model, one request, assert schema + dim/dtype.
Errors: use existing server error system (ERROR_TYPE_*), no new EmbeddingStatus enum.
Acceptance: CI job that enables --embeddings and passes; OpenAI /v1/embeddings untouched.

Phase 2a — Concurrency Smoke & Guardrails

Parallel safety: N ∈ {2,4} requests; no dead/livelocks.
Latency guard: inflation ≤ 2.5× single-thread median per request.
Determinism note: cosine sim ≥ 0.999 vs single-thread baseline.
Acceptance: CI enforces guardrails (network fetch is the only allowed flake source).

Phase 3 — Cross-Platform Smokes (still opt-in)

Windows/MSVC Release path fallback.
macOS ARM64 smoke; document SIMD/BF16 deltas if applicable.
Acceptance: smokes are non-blocking; docs updated.

4) API Shape (minimal and explicit)

Route (opt-in): POST /v2/embeddings

Request

{
  "input": ["text a", "text b"],
  "threads": 2,
  "embd_normalize": 2
}

input (required): string or array of strings (server to follow existing JSON conventions).
threads (optional): per-request threading hint (bounded/validated).
embd_normalize (optional): normalization mode passed to embedding compute; defaults to 2 (L2 / Euclidean).

Removed from request (come from server context / startup configuration):

model
ctx_size
seed

Response

{
  "data": [
    { "index": 0, "embedding": [/* float32... */] },
    { "index": 1, "embedding": [/* float32... */] }
  ],
  "model": "resolved-gguf",
  "dim": 384,
  "usage": { "prompt_tokens": 18 }
}

Error

{ "error": { "type": "ERROR_TYPE_*", "message": "..." } }

Error serialization follows the server’s existing ERROR_TYPE_* formatting (no new “code” enum).

5) Determinism Policy (tests)

Deterministic mode (for CI): threads=1.
Multi-thread: allow tiny numeric deltas; assert cosine sim ≥ 0.999 vs single-thread.

Seed is not a request parameter here. If determinism needs a seed, it is part of server/CLI context configuration, not the request contract.

6) Performance & Resource Budgets (guardrails)

Tiny model peak mem: < 1.5 GB.
Single request P50: < 800 ms on CI defaults.
Parallel N=4 inflation: ≤ 2.5× per-request vs single-thread median.

7) Embedding Compute Interface and Call Sites

7.1 Compute API (existing)

Use the already-implemented API:

LLAMA_API struct llama_embedding_result * llama_embedding_compute(
    struct llama_context * ctx,
    const llama_token * const * token_arrays,
    const size_t * token_array_lengths,
    size_t n_sequences,
    int embd_normalize);

LLAMA_API void llama_embedding_result_free(struct llama_embedding_result * result);

7.2 File location

Embedding-related declarations and thin glue live in:
- src/llama-embedding.h
- src/llama-embedding.cpp

7.3 CLI usage

CLI builds a context with embeddings enabled (existing patterns).
CLI tokenizes input into token arrays.
CLI calls llama_embedding_compute() directly with the context and tokens.
CLI prints unchanged JSON or raw output (CLI determines format).

7.4 Server usage

Server uses its existing context lifecycle (model already loaded, ctx size configured).
Server tokenizes via existing tokenize_input_prompts().
Server calls llama_embedding_compute() (via the embedding TU if needed for linkage/organization).
Server returns JSON.

8) Error Handling

Use the existing server error system (ERROR_TYPE_*) and existing HTTP mapping conventions.
This RFC requires:
- Invalid request (missing input, bad threads, invalid embd_normalize) → existing “bad request” error type.
- Route disabled → existing server behavior for disabled routes (no new code path required beyond gating on --embeddings).
- Backend failure during compute → existing “internal error” type.
- Timeout behavior → existing server timeout error type (if present) or standard server behavior.

This RFC intentionally does not introduce a parallel enum or a new mapping table.

9) Observability (phase 1)

INFO: route, resolved model id/path, dim, threads, ctx, batch size, duration ms, embd_normalize.
Counters/Timers (if hooks exist): embd_requests_total, embd_latency_ms{route="/v2/embeddings"}, embd_error_total{type}
Debug toggle: dump first 4 floats for first item (CI triage friendly).

10) CI / Test Matrix

Functional

Schema + dim/dtype ✓
Deterministic replay (threads=1) ✓
Cosine sim ≥ 0.999 (threads>1) ✓
Concurrency N={2,4}: no deadlocks; latency inflation ≤ 2.5× ✓
Error paths: bad args produce existing server error types ✓

Platforms

Linux x86_64 GCC/Clang (required)
Windows MSVC Release (smoke; non-blocking)
macOS ARM64 (smoke; non-blocking)

Build flavors

Release (required)
Debug (optional nightly smoke)

11) Rollout & Compatibility

Default unchanged: existing OpenAI-compat /v1/embeddings is untouched.
Native route is opt-in behind existing --embeddings.
If defaults flip later, keep OpenAI-compat at /v1/... and native at /v2/... with release-note guidance.

12) Risks & Mitigations

Scope creep → Small PRs; compute API only; strict out-of-scope list (no wrappers).
CI flake → Tiny models, cache, strict timeouts; retries only for downloads.
API confusion → Opt-in route, explicit docs; no change to /v1.
Cross-platform quirks → smokes non-blocking; document deltas.
Header churn → keep changes localized to src/llama-embedding.{h,cpp} and call sites.

13) Slicing Plan (reviewer-friendly)

E2E: CLI Baseline (tests only)
– tiny models; dim/dtype/determinism/sim; ≤2m CI; no prod code.
– If deferred: track explicitly as follow-up.
Server Route (flagged-off)
– reuse --embeddings; POST /v2/embeddings; round-trip CI; logs/metrics.
Concurrency & Errors
– N={2,4} smoke; error-path tests; JSON schema tightening.
Cross-platform smokes
– non-blocking Windows/macOS verification; docs updates.

14) Appendix

14.1 Server config ownership

Model / ctx size / seed are server-owned startup configuration. Request only provides:

threads
embd_normalize

14.2 Reproducibility Notes

CI pins: threads=1 (and fixed server/CLI startup configuration).
JSON: float32 lists; in CI debug, print dim + first 2–4 floats.

Pointers to Relevant Code (for reviewers)

– CLI path: examples/llama-embedding/... (tokenize + compute + output)
– Embedding TU: src/llama-embedding.{h,cpp}
– C API: include/llama.h (llama_embedding_compute, contexts, pooling)
– Server: tokenize helper tokenize_input_prompts(), route registration, error plumbing (ERROR_TYPE_*)

SamMalayek · 2025-11-03T05:50:32Z

SamMalayek
Nov 3, 2025
Author

IN PROGRESS. I'll ping a few folks when ready for review.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] RFC: Native Embeddings via a Shared Adapter (CLI ↔ Server), plus CI E2E Baseline #16957

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[DRAFT] RFC: Native Embeddings via a Shared Adapter (CLI ↔ Server), plus CI E2E Baseline #16957

Uh oh!

Uh oh!

SamMalayek Nov 3, 2025

IN PROGRESS

0) TL;DR

1) Problem

2) Approach

3) Phases & Acceptance

Phase 0 — Groundwork (no behavior change)

Phase 1 — CLI E2E Baseline (incomplete unless implemented in this effort)

Phase 2 — Native Server Endpoint (opt-in)

Phase 2a — Concurrency Smoke & Guardrails

Phase 3 — Cross-Platform Smokes (still opt-in)

4) API Shape (minimal and explicit)

Request

Response

Error

5) Determinism Policy (tests)

6) Performance & Resource Budgets (guardrails)

7) Embedding Compute Interface and Call Sites

7.1 Compute API (existing)

7.2 File location

7.3 CLI usage

7.4 Server usage

8) Error Handling

9) Observability (phase 1)

10) CI / Test Matrix

Functional

Platforms

Build flavors

11) Rollout & Compatibility

12) Risks & Mitigations

13) Slicing Plan (reviewer-friendly)

14) Appendix

14.1 Server config ownership

14.2 Reproducibility Notes

Pointers to Relevant Code (for reviewers)

Replies: 1 comment

Uh oh!

SamMalayek Nov 3, 2025 Author

SamMalayek
Nov 3, 2025

SamMalayek
Nov 3, 2025
Author