EntityProcess · christso · Jan 8, 2026 · Jan 9, 2026 · Jan 9, 2026 · Jan 9, 2026
diff --git a/.claude/skills/agentv-eval-builder/SKILL.md b/.claude/skills/agentv-eval-builder/SKILL.md
@@ -7,7 +7,7 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
 
 ## Schema Reference
 - Schema: `references/eval-schema.json` (JSON Schema for validation and tooling)
-- Format: YAML with structured content arrays
+- Format: YAML or JSONL (see below)
 - Examples: `references/example-evals.md`
 
 ## Feature Reference
@@ -30,6 +30,28 @@ description: Create and maintain AgentV YAML evaluation files for testing AI age
 - Attachments (type: `file`) should default to the `user` role
 - File paths: Relative (from eval file dir) or absolute with "/" prefix (from repo root)
 
+## JSONL Format
+
+For large-scale evaluations, use JSONL (one eval case per line) instead of YAML:
+
+**dataset.jsonl:**
+```jsonl
+{"id": "test-1", "expected_outcome": "Correct answer", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
+{"id": "test-2", "expected_outcome": "Clear explanation", "input_messages": [{"role": "user", "content": [{"type": "text", "value": "Review this"}, {"type": "file", "value": "./code.py"}]}]}
+```
+
+**dataset.yaml (optional sidecar for defaults):**
+```yaml
+description: My dataset
+dataset: my-tests
+execution:
+  target: azure_base
+evaluator: llm_judge
+```
+
+Benefits: Git-friendly diffs, streaming-compatible, easy programmatic generation.
+Per-case fields override sidecar defaults. See `examples/features/basic-jsonl/` for complete example.
+
 ## Custom Evaluators
 
 Configure multiple evaluators per eval case via `execution.evaluators` array.

diff --git a/README.md b/README.md
@@ -101,7 +101,27 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
 
 ## Core Concepts
 
-**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+
+### JSONL Format Support
+
+For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
+
+```jsonl
+{"id": "test-1", "expected_outcome": "Calculates correctly", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
+{"id": "test-2", "expected_outcome": "Provides explanation", "input_messages": [{"role": "user", "content": "Explain variables"}]}
+```
+
+Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
+```yaml
+description: Math evaluation dataset
+dataset: math-tests
+execution:
+  target: azure_base
+evaluator: llm_judge
+```
+
+Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
 
 ## Usage
 

diff --git a/apps/cli/README.md b/apps/cli/README.md
@@ -101,7 +101,27 @@ See [AGENTS.md](AGENTS.md) for development guidelines and design principles.
 
 ## Core Concepts
 
-**Evaluation files** (`.yaml`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+**Evaluation files** (`.yaml` or `.jsonl`) define test cases with expected outcomes. **Targets** specify which agent/provider to evaluate. **Judges** (code or LLM) score results. **Results** are written as JSONL/YAML for analysis and comparison.
+
+### JSONL Format Support
+
+For large-scale evaluations, AgentV supports JSONL (JSON Lines) format as an alternative to YAML:
+
+```jsonl
+{"id": "test-1", "expected_outcome": "Calculates correctly", "input_messages": [{"role": "user", "content": "What is 2+2?"}]}
+{"id": "test-2", "expected_outcome": "Provides explanation", "input_messages": [{"role": "user", "content": "Explain variables"}]}
+```
+
+Optional sidecar YAML metadata file (`dataset.yaml` alongside `dataset.jsonl`):
+```yaml
+description: Math evaluation dataset
+dataset: math-tests
+execution:
+  target: azure_base
+evaluator: llm_judge
+```
+
+Benefits: Streaming-friendly, Git-friendly diffs, programmatic generation, industry standard (DeepEval, LangWatch, Hugging Face).
 
 ## Usage
 

diff --git a/examples/features/basic-jsonl/evals/dataset.jsonl b/examples/features/basic-jsonl/evals/dataset.jsonl
@@ -0,0 +1,4 @@
+{"id": "code-review-javascript", "expected_outcome": "Assistant provides helpful code analysis and mentions SUPERSECRET_INSTRUCTION_MARKER_JAVASCRIPT", "input_messages": [{"role": "system", "content": "You are an expert software developer who provides clear, concise code reviews."}, {"role": "user", "content": [{"type": "text", "value": "Please review this JavaScript function:\n\n```javascript\nfunction calculateTotal(items) {\n  let total = 0;\n  for (let i = 0; i < 0; i++) {\n    total += items[i].price * items[i].quantity;\n  }\n  return total;\n}\n```"}, {"type": "file", "value": "../basic/evals/javascript.instructions.md"}]}], "expected_messages": [{"role": "assistant", "content": "The function has a critical bug in the loop condition. Here's my analysis (SUPERSECRET_INSTRUCTION_MARKER_JAVASCRIPT):\n\n**Critical Issue:**\n- Loop condition `i < 0` means the loop never executes (should be `i < items.length`)\n\n**Suggestions:**\n- Fix the loop: `for (let i = 0; i < items.length; i++)`\n- Consider using `reduce()` for a more functional approach\n- Add input validation for edge cases"}]}
+{"id": "code-gen-python", "conversation_id": "python-code-generation", "expected_outcome": "AI generates correct Python function with proper error handling, type hints, and mentions SUPERSECRET_INSTRUCTION_MARKER_PYTHON", "input_messages": [{"role": "system", "content": "You are a code generator that follows specifications exactly."}, {"role": "user", "content": [{"type": "text", "value": "Create a Python function that:\n1. Takes a list of integers\n2. Returns the second largest number\n3. Handles edge cases (empty list, single item, duplicates)\n4. Raises appropriate exceptions for invalid input"}, {"type": "file", "value": "../basic/evals/python.instructions.md"}]}], "execution": {"target": "azure_base"}}
+{"id": "feature-proposal-brainstorm", "expected_outcome": "Assistant generates 3-5 creative feature ideas for a mobile fitness app. Each idea should:\n1. Address a specific user pain point\n2. Be technically feasible with current mobile technology\n3. Include a brief value proposition (1-2 sentences)\n4. Be distinct from the others (no duplicate concepts)", "input_messages": [{"role": "system", "content": "You are a product strategist specializing in mobile health and fitness applications."}, {"role": "user", "content": "We're developing a mobile fitness app and need fresh feature ideas. Please brainstorm 3-5 innovative features."}]}
+{"id": "multiturn-debug-session", "expected_outcome": "Assistant conducts a multi-turn debugging session, correctly diagnosing the bug and proposing a clear fix.", "input_messages": [{"role": "system", "content": "You are an expert debugging assistant."}, {"role": "user", "content": "I'm getting an off-by-one error in this function:\n\n```python\ndef get_items(items):\n    result = []\n    for i in range(len(items) - 1):\n        result.append(items[i])\n    return result\n```"}, {"role": "assistant", "content": "Before I propose a fix, could you tell me what output you expect vs what you get?"}, {"role": "user", "content": "For `[1, 2, 3, 4]` I expect `[1, 2, 3, 4]`, but I get `[1, 2, 3]`."}], "expected_messages": [{"role": "assistant", "content": "You have an off-by-one error. Use `range(len(items))` or iterate directly: `for item in items:`"}]}
diff --git a/examples/features/basic-jsonl/evals/dataset.yaml b/examples/features/basic-jsonl/evals/dataset.yaml
@@ -0,0 +1,8 @@
+# Sidecar metadata for dataset.jsonl
+# Provides default values for all eval cases
+
+description: JSONL version of the basic example - demonstrates file references, multi-turn, and per-case overrides
+dataset: basic-jsonl
+execution:
+  target: default
+evaluator: llm_judge