[DRAFT] RFC: Native Embeddings via a Shared Adapter (CLI ↔ Server), plus CI E2E Baseline #16957
SamMalayek
started this conversation in
Ideas
Replies: 1 comment
-
|
IN PROGRESS. I'll ping a few folks when ready for review. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
IN PROGRESS
Working on other stuff with timeline for weeks to come.
0) TL;DR
Add a minimal, opt-in native embeddings route in
llama-serverthat uses the existing embedding compute API (llama_embedding_compute()), organized via the existing embedding TU atsrc/llama-embedding.{h,cpp}. Land a fast, deterministic E2E test suite that hardens shape/dtype/determinism and basic parallel safety. No default behavior changes in this phase.1) Problem
llama-serverexposes OpenAI-compat/v1/embeddings, but does not use first-party embedding kernels.Goal: Ship a native server endpoint that uses the same underlying embedding compute API as the CLI/server implementation (
llama_embedding_compute()), proven by a green, deterministic CI.Non-goals (for this RFC/phase):
2) Approach
Use the existing C API entrypoint
llama_embedding_compute()(already implemented) and standardize callsites (CLI + server) around the same low-level contract:Both CLI and server:
llama_contextwith embeddings enabled.llama_embedding_compute()over token arrays.Code lives in the actual repo location
src/llama-embedding.{h,cpp}to keep embedding-related declarations/glue consolidated without introducing a new “run_embedding(...)” wrapper.3) Phases & Acceptance
Phase 0 — Groundwork (no behavior change)
Names/placement:
src/llama-embedding.{h,cpp}(actual location).Contracts:
Determinism envelope (for tests):
threads=1, fixed ctx size/model configured at server start, JSON float32 serialization.Acceptance: Maintainer ACK on file location, naming, and request surface.
Phase 1 — CLI E2E Baseline (incomplete unless implemented in this effort)
threads=1), cosine sim ≥ 0.999 forthreads>1.Phase 2 — Native Server Endpoint (opt-in)
Route:
POST /v2/embeddingsbehind existing--embeddingsflag (reused).Server context:
Round-trip test: tiny model, one request, assert schema + dim/dtype.
Errors: use existing server error system (
ERROR_TYPE_*), no new EmbeddingStatus enum.Acceptance: CI job that enables
--embeddingsand passes; OpenAI/v1/embeddingsuntouched.Phase 2a — Concurrency Smoke & Guardrails
Phase 3 — Cross-Platform Smokes (still opt-in)
4) API Shape (minimal and explicit)
Route (opt-in):
POST /v2/embeddingsRequest
{ "input": ["text a", "text b"], "threads": 2, "embd_normalize": 2 }input(required): string or array of strings (server to follow existing JSON conventions).threads(optional): per-request threading hint (bounded/validated).embd_normalize(optional): normalization mode passed to embedding compute; defaults to 2 (L2 / Euclidean).Removed from request (come from server context / startup configuration):
modelctx_sizeseedResponse
{ "data": [ { "index": 0, "embedding": [/* float32... */] }, { "index": 1, "embedding": [/* float32... */] } ], "model": "resolved-gguf", "dim": 384, "usage": { "prompt_tokens": 18 } }Error
{ "error": { "type": "ERROR_TYPE_*", "message": "..." } }5) Determinism Policy (tests)
threads=1.6) Performance & Resource Budgets (guardrails)
7) Embedding Compute Interface and Call Sites
7.1 Compute API (existing)
Use the already-implemented API:
7.2 File location
Embedding-related declarations and thin glue live in:
src/llama-embedding.hsrc/llama-embedding.cpp7.3 CLI usage
llama_embedding_compute()directly with the context and tokens.7.4 Server usage
tokenize_input_prompts().llama_embedding_compute()(via the embedding TU if needed for linkage/organization).8) Error Handling
Use the existing server error system (
ERROR_TYPE_*) and existing HTTP mapping conventions.This RFC requires:
embd_normalize) → existing “bad request” error type.--embeddings).9) Observability (phase 1)
embd_normalize.embd_requests_total,embd_latency_ms{route="/v2/embeddings"},embd_error_total{type}10) CI / Test Matrix
Functional
threads=1) ✓threads>1) ✓Platforms
Build flavors
11) Rollout & Compatibility
/v1/embeddingsis untouched.--embeddings./v1/...and native at/v2/...with release-note guidance.12) Risks & Mitigations
/v1.src/llama-embedding.{h,cpp}and call sites.13) Slicing Plan (reviewer-friendly)
E2E: CLI Baseline (tests only)
– tiny models; dim/dtype/determinism/sim; ≤2m CI; no prod code.
– If deferred: track explicitly as follow-up.
Server Route (flagged-off)
– reuse
--embeddings;POST /v2/embeddings; round-trip CI; logs/metrics.Concurrency & Errors
– N={2,4} smoke; error-path tests; JSON schema tightening.
Cross-platform smokes
– non-blocking Windows/macOS verification; docs updates.
14) Appendix
14.1 Server config ownership
Model / ctx size / seed are server-owned startup configuration. Request only provides:
threadsembd_normalize14.2 Reproducibility Notes
threads=1(and fixed server/CLI startup configuration).Pointers to Relevant Code (for reviewers)
– CLI path:
examples/llama-embedding/...(tokenize + compute + output)– Embedding TU:
src/llama-embedding.{h,cpp}– C API:
include/llama.h(llama_embedding_compute, contexts, pooling)– Server: tokenize helper
tokenize_input_prompts(), route registration, error plumbing (ERROR_TYPE_*)Beta Was this translation helpful? Give feedback.
All reactions