Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 23 additions & 1 deletion .claude/skills/agentv-eval-builder/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age

## Schema Reference
- Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
- Format: YAML with structured content arrays
- Format: YAML or JSONL (see below)
- Examples: `references/example-evals.md`

## Feature Reference
Expand All @@ -30,6 +30,28 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
- Attachments (type: `file`) should default to the `user` role
- File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)

## JSONL Format

For large-scale evaluations, use JSONL (one eval case per line) instead of YAML:

**dataset.jsonl:**
```jsonl
{"id": "test-1", "expected_outcome": "Correct answer", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
{"id": "test-2", "expected_outcome": "Clear explanation", "input_messages": [{"role": "user", "content": [{"type": "text", "value": "Review this"}, {"type": "file", "value": "./code.py"}]}]}
```

**dataset.yaml (optional sidecar for defaults):**
```yaml
description: My dataset
dataset: my-tests
execution:
target: azure_base
evaluator: llm_judge
```

Benefits: Git-friendly diffs, streaming-compatible, easy programmatic generation.
Per-case fields override sidecar defaults. See `examples/features/basic-jsonl/` for complete example.

## Custom Evaluators

Configure multiple evaluators per eval case via `execution.evaluators` array.
Expand Down
22 changes: 21 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,27 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.

## Core Concepts

**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.

### JSONL Format Support

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:

```jsonl
{"id": "test-1", "expected_outcome": "Calculates correctly", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
{"id": "test-2", "expected_outcome": "Provides explanation", "input_messages": [{"role": "user", "content": "Explain variables"}]}
```

Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
```yaml
description: Math evaluation dataset
dataset: math-tests
execution:
target: azure_base
evaluator: llm_judge
```

Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).

## Usage

Expand Down
22 changes: 21 additions & 1 deletion apps/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,27 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.

## Core Concepts

**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.

### JSONL Format Support

For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:

```jsonl
{"id": "test-1", "expected_outcome": "Calculates correctly", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
{"id": "test-2", "expected_outcome": "Provides explanation", "input_messages": [{"role": "user", "content": "Explain variables"}]}
```

Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
```yaml
description: Math evaluation dataset
dataset: math-tests
execution:
target: azure_base
evaluator: llm_judge
```

Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).

## Usage

Expand Down
4 changes: 4 additions & 0 deletions examples/features/basic-jsonl/evals/dataset.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{"id": "code-review-javascript", "expected_outcome": "Assistant provides helpful code analysis and mentions SUPERSECRET_INSTRUCTION_MARKER_JAVASCRIPT", "input_messages": [{"role": "system", "content": "You are an expert software developer who provides clear, concise code reviews."}, {"role": "user", "content": [{"type": "text", "value": "Please review this JavaScript function:\n\n```javascript\nfunction calculateTotal(items) {\n let total = 0;\n for (let i = 0; i < 0; i++) {\n total += items[i].price * items[i].quantity;\n }\n return total;\n}\n```"}, {"type": "file", "value": "../basic/evals/javascript.instructions.md"}]}], "expected_messages": [{"role": "assistant", "content": "The function has a critical bug in the loop condition. Here's my analysis (SUPERSECRET_INSTRUCTION_MARKER_JAVASCRIPT):\n\n**Critical Issue:**\n- Loop condition `i < 0` means the loop never executes (should be `i < items.length`)\n\n**Suggestions:**\n- Fix the loop: `for (let i = 0; i < items.length; i++)`\n- Consider using `reduce()` for a more functional approach\n- Add input validation for edge cases"}]}
{"id": "code-gen-python", "conversation_id": "python-code-generation", "expected_outcome": "AI generates correct Python function with proper error handling, type hints, and mentions SUPERSECRET_INSTRUCTION_MARKER_PYTHON", "input_messages": [{"role": "system", "content": "You are a code generator that follows specifications exactly."}, {"role": "user", "content": [{"type": "text", "value": "Create a Python function that:\n1. Takes a list of integers\n2. Returns the second largest number\n3. Handles edge cases (empty list, single item, duplicates)\n4. Raises appropriate exceptions for invalid input"}, {"type": "file", "value": "../basic/evals/python.instructions.md"}]}], "execution": {"target": "azure_base"}}
{"id": "feature-proposal-brainstorm", "expected_outcome": "Assistant generates 3-5 creative feature ideas for a mobile fitness app. Each idea should:\n1. Address a specific user pain point\n2. Be technically feasible with current mobile technology\n3. Include a brief value proposition (1-2 sentences)\n4. Be distinct from the others (no duplicate concepts)", "input_messages": [{"role": "system", "content": "You are a product strategist specializing in mobile health and fitness applications."}, {"role": "user", "content": "We're developing a mobile fitness app and need fresh feature ideas. Please brainstorm 3-5 innovative features."}]}
{"id": "multiturn-debug-session", "expected_outcome": "Assistant conducts a multi-turn debugging session, correctly diagnosing the bug and proposing a clear fix.", "input_messages": [{"role": "system", "content": "You are an expert debugging assistant."}, {"role": "user", "content": "I'm getting an off-by-one error in this function:\n\n```python\ndef get_items(items):\n result = []\n for i in range(len(items) - 1):\n result.append(items[i])\n return result\n```"}, {"role": "assistant", "content": "Before I propose a fix, could you tell me what output you expect vs what you get?"}, {"role": "user", "content": "For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`."}], "expected_messages": [{"role": "assistant", "content": "You have an off-by-one error. Use `range(len(items))` or iterate directly: `for item in items:`"}]}
8 changes: 8 additions & 0 deletions examples/features/basic-jsonl/evals/dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Sidecar metadata for dataset.jsonl
# Provides default values for all eval cases

description: JSONL version of the basic example - demonstrates file references, multi-turn, and per-case overrides
dataset: basic-jsonl
execution:
target: default
evaluator: llm_judge
Loading