diff --git a/openspec/changes/add-langfuse-export/design.md b/openspec/changes/add-langfuse-export/design.md new file mode 100644 index 00000000..0cca5b97 --- /dev/null +++ b/openspec/changes/add-langfuse-export/design.md @@ -0,0 +1,199 @@ +# Design: Langfuse Export for Observability + +## Context + +AgentV produces `output_messages` arrays containing tool calls, assistant responses, and timestamps during evaluation runs. This data is valuable for debugging and monitoring but currently stays within AgentV's result files. + +Industry frameworks (LangWatch, Mastra, Google ADK, Azure SDK) have adopted OpenTelemetry semantic conventions for LLM observability. Langfuse is an open-source platform that accepts traces in a compatible format. + +**Stakeholders:** +- AgentV users who need to debug agent behavior +- Teams integrating AgentV into existing LLMOps workflows +- Developers comparing agent configurations across runs + +## Goals / Non-Goals + +**Goals:** +- Export `output_messages` to Langfuse as structured traces +- Follow OpenTelemetry GenAI semantic conventions where applicable +- Provide opt-in content capture for privacy-sensitive environments +- Keep export logic decoupled from core evaluation flow + +**Non-Goals:** +- Full OpenTelemetry SDK integration (deferred) +- Real-time streaming of traces during execution +- Bi-directional sync with Langfuse (import traces) +- Support for other observability platforms in this change (extensible design only) + +## Decisions + +### Decision 1: Use Langfuse SDK directly (not OTEL SDK) + +**What:** Import `langfuse` npm package and use its native trace/span API. + +**Why:** +- Langfuse SDK handles authentication, batching, and flush automatically +- Avoids complexity of OTEL collector setup +- Direct mapping to Langfuse concepts (traces, generations, spans) +- Can add OTEL exporter later as separate capability + +**Alternatives considered:** +- Full OTEL SDK + OTLP exporter: More portable but requires collector infrastructure +- Custom HTTP calls: Fragile, no batching, reinvents SDK features + +### Decision 2: Map OutputMessage to Langfuse structure + +**Mapping:** + +| AgentV Concept | Langfuse Concept | Notes | +|----------------|------------------|-------| +| Evaluation run | Trace | One trace per eval case | +| `eval_id` | `trace.name` | Identifies the test case | +| `target` | `trace.metadata.target` | Which provider was used | +| Assistant message with content | Generation | LLM response | +| Tool call | Span (type: "tool") | Individual tool invocation | +| `score` | Score | Attached to trace | + +**Langfuse Trace Structure:** +``` +Trace: eval_id="case-001" +├── Generation: "assistant response" +│ ├── input: [user messages] +│ ├── output: "response text" +│ └── usage: { input_tokens, output_tokens } +├── Span: tool="search" (type: tool) +│ ├── input: { query: "..." } +│ └── output: "results..." +├── Span: tool="read_file" (type: tool) +│ └── ... +└── Score: name="eval_score", value=0.85 +``` + +### Decision 3: Attribute naming follows GenAI conventions + +Use `gen_ai.*` prefixed attributes where applicable: + +```typescript +// Generation attributes +'gen_ai.request.model': target.model, +'gen_ai.usage.input_tokens': usage?.input_tokens, +'gen_ai.usage.output_tokens': usage?.output_tokens, + +// Tool span attributes +'gen_ai.tool.name': toolCall.tool, +'gen_ai.tool.call.id': toolCall.id, + +// Trace metadata +'agentv.eval_id': evalCase.id, +'agentv.target': target.name, +'agentv.dataset': evalCase.dataset, +``` + +### Decision 4: Privacy-first content capture + +**Default:** Do not capture message content or tool inputs/outputs. + +**Opt-in:** Set `LANGFUSE_CAPTURE_CONTENT=true` to include: +- User message content +- Assistant response content +- Tool call inputs and outputs + +**Rationale:** Traces may contain PII, secrets, or proprietary data. Following Azure SDK and Google ADK patterns of opt-in content capture. + +### Decision 5: Flush strategy + +**Approach:** Flush traces after each eval case completes (not batched across cases). + +**Why:** +- Ensures traces are visible in Langfuse promptly +- Avoids data loss if process crashes +- Trade-off: Slightly higher network overhead (acceptable for eval workloads) + +**Configuration:** No user-facing config in v1. Can add `--langfuse-batch` later if needed. + +## Data Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ agentv run │ +│ │ +│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ +│ │ Provider │───▶│ Orchestrator │───▶│ EvaluationResult │ │ +│ │ Response │ │ │ │ + outputMessages │ │ +│ └──────────────┘ └──────────────┘ └────────┬─────────┘ │ +│ │ │ +│ ┌────────────────────────▼──────┐ │ +│ │ LangfuseExporter │ │ +│ │ (if --langfuse enabled) │ │ +│ └────────────────┬──────────────┘ │ +│ │ │ +└────────────────────────────────────────────┼────────────────────┘ + │ + ▼ + ┌─────────────────┐ + │ Langfuse │ + │ Platform │ + └─────────────────┘ +``` + +## API Surface + +### CLI + +```bash +# Enable Langfuse export +agentv run eval.yaml --langfuse + +# With custom host (self-hosted Langfuse) +LANGFUSE_HOST=https://langfuse.mycompany.com agentv run eval.yaml --langfuse + +# With content capture +LANGFUSE_CAPTURE_CONTENT=true agentv run eval.yaml --langfuse +``` + +### Environment Variables + +| Variable | Required | Description | +|----------|----------|-------------| +| `LANGFUSE_PUBLIC_KEY` | Yes (if --langfuse) | Langfuse project public key | +| `LANGFUSE_SECRET_KEY` | Yes (if --langfuse) | Langfuse project secret key | +| `LANGFUSE_HOST` | No | Custom Langfuse host (default: cloud) | +| `LANGFUSE_CAPTURE_CONTENT` | No | Enable content capture (default: false) | + +### Programmatic API + +```typescript +import { LangfuseExporter } from '@agentv/core/observability'; + +const exporter = new LangfuseExporter({ + publicKey: process.env.LANGFUSE_PUBLIC_KEY, + secretKey: process.env.LANGFUSE_SECRET_KEY, + host: process.env.LANGFUSE_HOST, + captureContent: process.env.LANGFUSE_CAPTURE_CONTENT === 'true', +}); + +// Export a single result +await exporter.export(evaluationResult, outputMessages); + +// Flush pending traces +await exporter.flush(); +``` + +## Risks / Trade-offs + +| Risk | Mitigation | +|------|------------| +| Langfuse SDK version churn | Pin to stable version, document upgrade path | +| Network failures during export | Log warning, don't fail evaluation; traces are optional | +| Large traces with many tool calls | Langfuse handles batching internally; monitor payload sizes | +| Content capture leaking secrets | Default to off; document clearly in CLI help | + +## Migration Plan + +**No migration required.** This is a new optional feature. Existing users are unaffected unless they enable `--langfuse`. + +## Open Questions + +1. Should we support `--langfuse-session-id` to group multiple eval runs? (Defer to user feedback) +2. Should token usage be estimated if provider doesn't return it? (Defer - not all providers report usage) +3. Should we add a `--dry-run-langfuse` to preview traces without sending? (Nice to have, not v1) diff --git a/openspec/changes/add-langfuse-export/proposal.md b/openspec/changes/add-langfuse-export/proposal.md new file mode 100644 index 00000000..93cde9e6 --- /dev/null +++ b/openspec/changes/add-langfuse-export/proposal.md @@ -0,0 +1,32 @@ +# Change: Add Langfuse Export for Observability + +## Why + +AgentV captures rich execution traces via `output_messages` (tool calls, assistant responses, timestamps) but has no way to export this data to observability platforms. Users need to debug agent behavior, monitor performance, and integrate with existing LLMOps tooling. + +Langfuse is an open-source LLM observability platform that supports OpenTelemetry-compatible trace ingestion. By exporting AgentV traces to Langfuse, users can: +- Visualize agent execution flows +- Debug tool call sequences +- Track token usage and latency across evaluations +- Compare agent behavior across different configurations + +## What Changes + +- **Add `langfuse` export option**: Convert `output_messages` to OpenTelemetry-compatible spans and send to Langfuse + - New `--langfuse` CLI flag enables export during `agentv run` + - Supports `LANGFUSE_PUBLIC_KEY`, `LANGFUSE_SECRET_KEY`, `LANGFUSE_HOST` environment variables + - Maps `OutputMessage` and `ToolCall` to Langfuse trace/span format + - Uses `gen_ai.*` semantic conventions for LLM attributes + - Optional content capture controlled by `LANGFUSE_CAPTURE_CONTENT` (default: false for privacy) + +- **Add new `observability` capability spec**: Defines trace export behavior and provider contracts + +## Impact + +- Affected specs: New `observability` capability (does not modify existing specs) +- Affected code: + - `packages/core/src/observability/` (new directory) + - `packages/core/src/observability/langfuse-exporter.ts` (new file) + - `packages/core/src/observability/types.ts` (new file) + - `apps/cli/src/index.ts` (add `--langfuse` flag to run command) + - `packages/core/package.json` (add `langfuse` dependency) diff --git a/openspec/changes/add-langfuse-export/specs/observability/spec.md b/openspec/changes/add-langfuse-export/specs/observability/spec.md new file mode 100644 index 00000000..711ec368 --- /dev/null +++ b/openspec/changes/add-langfuse-export/specs/observability/spec.md @@ -0,0 +1,104 @@ +# Spec: Observability Capability + +## Purpose + +Defines trace export functionality for sending AgentV evaluation data to external observability platforms. Enables debugging, monitoring, and analysis of agent execution through industry-standard tooling. + +## ADDED Requirements + +### Requirement: Langfuse Trace Export + +The system SHALL support exporting evaluation traces to Langfuse when enabled via CLI flag. + +#### Scenario: Export enabled with valid credentials + +- **WHEN** the user runs `agentv run eval.yaml --langfuse` +- **AND** `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` environment variables are set +- **THEN** the system creates a Langfuse trace for each completed eval case +- **AND** the trace includes the `eval_id` as the trace name +- **AND** the trace includes metadata for `target`, `dataset`, and `score` + +#### Scenario: Export disabled by default + +- **WHEN** the user runs `agentv run eval.yaml` without `--langfuse` flag +- **THEN** no traces are sent to Langfuse +- **AND** the evaluation proceeds normally without observability overhead + +#### Scenario: Missing credentials with flag enabled + +- **WHEN** the user runs `agentv run eval.yaml --langfuse` +- **AND** `LANGFUSE_PUBLIC_KEY` or `LANGFUSE_SECRET_KEY` is not set +- **THEN** the system emits a warning message +- **AND** evaluation proceeds without Langfuse export + +### Requirement: OutputMessage to Trace Mapping + +The system SHALL convert `output_messages` to Langfuse-compatible trace structure. + +#### Scenario: Assistant message becomes Generation + +- **WHEN** an `OutputMessage` has `role: "assistant"` and `content` +- **THEN** a Langfuse Generation is created with the content as output +- **AND** the Generation includes `gen_ai.request.model` if available from target + +#### Scenario: Tool call becomes Span + +- **WHEN** an `OutputMessage` contains `toolCalls` array +- **THEN** each `ToolCall` becomes a Langfuse Span with `type: "tool"` +- **AND** the Span includes `gen_ai.tool.name` attribute set to the tool name +- **AND** the Span includes `gen_ai.tool.call.id` if the tool call has an `id` + +#### Scenario: Evaluation score attached to trace + +- **WHEN** an `EvaluationResult` is exported +- **THEN** the trace includes a Langfuse Score with `name: "eval_score"` and `value` set to the result score +- **AND** the Score includes `comment` with the evaluation reasoning if available + +### Requirement: Privacy-Controlled Content Capture + +The system SHALL respect privacy settings when exporting trace content. + +#### Scenario: Content capture disabled (default) + +- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is not set or set to `"false"` +- **THEN** message content is replaced with placeholder text `"[content hidden]"` +- **AND** tool call inputs are replaced with `{}` +- **AND** tool call outputs are replaced with `"[output hidden]"` + +#### Scenario: Content capture enabled + +- **WHEN** `LANGFUSE_CAPTURE_CONTENT` is set to `"true"` +- **THEN** full message content is included in Generations +- **AND** full tool call inputs and outputs are included in Spans + +### Requirement: Custom Langfuse Host + +The system SHALL support self-hosted Langfuse instances. + +#### Scenario: Custom host configuration + +- **WHEN** `LANGFUSE_HOST` environment variable is set +- **THEN** the exporter sends traces to the specified host URL +- **AND** authentication uses the same `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY` + +#### Scenario: Default to cloud host + +- **WHEN** `LANGFUSE_HOST` is not set +- **THEN** the exporter uses the default Langfuse cloud endpoint + +### Requirement: Graceful Export Failures + +The system SHALL handle export errors without disrupting evaluation. + +#### Scenario: Network error during export + +- **WHEN** sending a trace to Langfuse fails due to network error +- **THEN** the system logs a warning with the error details +- **AND** the evaluation result is still written to the output file +- **AND** subsequent eval cases continue to attempt export + +#### Scenario: Flush at evaluation end + +- **WHEN** all eval cases have completed +- **THEN** the system flushes any pending traces to Langfuse +- **AND** waits for flush to complete before exiting (with timeout) diff --git a/openspec/changes/add-langfuse-export/tasks.md b/openspec/changes/add-langfuse-export/tasks.md new file mode 100644 index 00000000..b97d2fba --- /dev/null +++ b/openspec/changes/add-langfuse-export/tasks.md @@ -0,0 +1,51 @@ +# Tasks: Add Langfuse Export + +## 1. Core Implementation + +- [ ] 1.1 Create `packages/core/src/observability/` directory structure +- [ ] 1.2 Define `TraceExporter` interface in `types.ts` +- [ ] 1.3 Implement `LangfuseExporter` class with trace/span conversion +- [ ] 1.4 Add `langfuse` dependency to `packages/core/package.json` +- [ ] 1.5 Export observability module from `packages/core/src/index.ts` + +## 2. OutputMessage to Langfuse Mapping + +- [ ] 2.1 Implement `convertToLangfuseTrace()` function +- [ ] 2.2 Map `OutputMessage` with content to Langfuse Generation +- [ ] 2.3 Map `ToolCall` to Langfuse Span (type: tool) +- [ ] 2.4 Attach evaluation score to trace +- [ ] 2.5 Add `gen_ai.*` semantic convention attributes + +## 3. Privacy Controls + +- [ ] 3.1 Implement content filtering based on `LANGFUSE_CAPTURE_CONTENT` +- [ ] 3.2 Strip message content when capture disabled +- [ ] 3.3 Strip tool inputs/outputs when capture disabled +- [ ] 3.4 Document privacy behavior in code comments + +## 4. CLI Integration + +- [ ] 4.1 Add `--langfuse` flag to `run` command in `apps/cli/src/index.ts` +- [ ] 4.2 Validate required environment variables when flag is set +- [ ] 4.3 Initialize `LangfuseExporter` when enabled +- [ ] 4.4 Call exporter after each `EvaluationResult` is produced +- [ ] 4.5 Flush exporter after all eval cases complete + +## 5. Error Handling + +- [ ] 5.1 Catch and log Langfuse SDK errors without failing evaluation +- [ ] 5.2 Warn on missing credentials when `--langfuse` is used +- [ ] 5.3 Handle network timeouts gracefully + +## 6. Testing + +- [ ] 6.1 Unit tests for `convertToLangfuseTrace()` mapping +- [ ] 6.2 Unit tests for content filtering logic +- [ ] 6.3 Integration test with mock Langfuse server (optional) +- [ ] 6.4 Add example in `examples/` directory + +## 7. Documentation + +- [ ] 7.1 Add CLI help text for `--langfuse` flag +- [ ] 7.2 Document environment variables in README or docs +- [ ] 7.3 Add usage example to CLI `--help` output