scaleapi
diff --git a/‎examples/demos/procurement_agent/docs/plans/2026-01-19-eval-framework-design.md‎
Lines changed: 179 additions & 0 deletions b/‎examples/demos/procurement_agent/docs/plans/2026-01-19-eval-framework-design.md‎
Lines changed: 179 additions & 0 deletions
@@ -0,0 +1,179 @@
+# Procurement Agent Eval Framework Design
+
+## Overview
+
+Integration tests that run against a live Temporal workflow to verify the procurement agent behaves correctly. Tests send real events via signals, capture tool call transcripts, and verify database state changes.
+
+## Architecture
+
+```
+evals/
+├── conftest.py              # Pytest fixtures: workflow setup, DB helpers, temporal client
+├── graders/
+│   ├── tool_calls.py        # Verify required/forbidden tool calls in transcript
+│   └── database.py          # Verify DB state changes
+├── tasks/
+│   ├── test_submittal_approved.py
+│   ├── test_shipment_departed.py      # Multiple cases for false positive issue
+│   ├── test_shipment_arrived.py
+│   ├── test_inspection_failed.py      # Human-in-the-loop scenarios
+│   └── test_inspection_passed.py
+└── fixtures/
+    └── events.py            # Pre-built event payloads for each scenario
+```
+
+## Test Flow (per task)
+
+1. Spin up fresh workflow via Temporal client
+2. Send event signal(s) to workflow
+3. Wait for agent to process (poll for completion or timeout)
+4. **Capture transcript** - extract tool calls from workflow history
+5. **Query DB** - check procurement_items and schedule tables
+6. **Grade** - assert required tool calls present, forbidden calls absent, DB state correct
+
+## Isolation
+
+- Each test gets a unique `workflow_id`
+- DB queries scoped to that workflow_id
+- No shared state between tests
+
+---
+
+## Test Cases
+
+### Submittal_Approved
+
+**What should happen:**
+1. Agent wakes up on event
+2. Issues purchase order (tool call)
+3. Creates procurement item in DB with status + PO ID
+
+| Test ID | Item | Expected Tool Calls | Expected DB State |
+|---------|------|---------------------|-------------------|
+| `submittal_01` | Steel Beams | `issue_purchase_order`, `create_procurement_item_tool` | `status="purchase_order_issued"`, `purchase_order_id` not null |
+| `submittal_02` | HVAC Units | Same | Same |
+
+**Grading:**
+- Required tools: `issue_purchase_order`, `create_procurement_item_tool`
+- DB: procurement item exists with correct status and PO ID
+
+---
+
+### Shipment_Departed_Factory (Critical - False Positive Issue)
+
+**What should happen:**
+1. Agent wakes up, ingests ETA
+2. Cross-references with master schedule
+3. **Only flag if ETA ≥ required_by** (zero/negative buffer)
+4. Update procurement item with ETA and status
+
+**Conflict Logic:**
+- Flag if ETA >= required_by (arriving on or after deadline)
+- Don't flag if ETA < required_by (arriving before deadline is OK)
+
+| Test ID | Item | ETA | Required By | Should Flag? | Rationale |
+|---------|------|-----|-------------|--------------|-----------|
+| `departed_01_no_flag` | Steel Beams | 2026-02-10 | 2026-02-15 | **NO** | 5 days early - well within buffer |
+| `departed_02_no_flag` | Steel Beams | 2026-02-14 | 2026-02-15 | **NO** | 1 day early - still OK |
+| `departed_03_flag` | Steel Beams | 2026-02-15 | 2026-02-15 | **YES** | Arrives ON deadline - zero buffer |
+| `departed_04_flag` | Steel Beams | 2026-02-20 | 2026-02-15 | **YES** | 5 days LATE - definite conflict |
+| `departed_05_no_flag` | Windows | 2026-03-05 | 2026-03-15 | **NO** | 10 days early - uses buffer but OK |
+| `departed_06_no_flag` | HVAC Units | 2026-02-28 | 2026-03-01 | **NO** | 1 day early - boundary case, still OK |
+
+**Grading:**
+- Always required: `update_procurement_item_tool`
+- If ETA >= required_by: also require `flag_potential_issue`
+- If ETA < required_by: `flag_potential_issue` is **forbidden** (catches false positives)
+- DB: procurement item has correct ETA and status
+
+---
+
+### Shipment_Arrived_Site
+
+**What should happen:**
+1. Agent wakes up on arrival event
+2. Notifies receiving team
+3. Schedules inspection
+4. Updates procurement item with arrival date and status
+
+| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State |
+|---------|------|-------------------|---------------------|-------------------|
+| `arrived_01` | Steel Beams | `shipment_departed` | `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool` | `status="shipment_arrived"`, `date_arrived` set |
+| `arrived_02` | Windows | `shipment_departed` | Same | Same |
+
+**Grading:**
+- Required tools: `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool`
+- DB: procurement item has status and arrival date set
+
+---
+
+### Inspection_Failed (Human-in-the-Loop)
+
+**What should happen:**
+1. Agent wakes up, analyzes failure
+2. Escalates to human with recommended action (`wait_for_human` tool)
+3. Pauses workflow until human responds
+4. On human response, executes appropriate actions
+
+**Simulating Human Input:**
+Send human responses via the `RECEIVE_EVENT` signal (same path as UI), which populates `human_queue`.
+
+| Test ID | Scenario | Human Response | Expected Tool Calls | Expected DB State |
+|---------|----------|----------------|---------------------|-------------------|
+| `failed_01_approve` | Human approves recommendation | `"Yes"` | `wait_for_human`, then agent's recommended tools | Status updated per recommendation |
+| `failed_02_approve_plus` | Human approves + extra action | `"Yes, and also update the delivery date to 2026-03-15"` | `wait_for_human`, `update_delivery_date_tool`, `update_procurement_item_tool` | Status updated + schedule delivery date changed |
+| `failed_03_reject_delete` | Human rejects, wants deletion | `"No, remove it from the master schedule entirely"` | `wait_for_human`, `remove_delivery_item_tool`, `delete_procurement_item_tool` | Item removed from schedule, procurement item deleted |
+
+**Test Flow:**
+1. Setup: ensure item exists in DB (from prior events)
+2. Send inspection failed event
+3. Wait for agent to call `wait_for_human` (poll workflow state)
+4. Send human response via signal
+5. Wait for agent to complete processing
+6. Grade: check transcript + DB
+
+---
+
+### Inspection_Passed
+
+**What should happen:**
+1. Agent wakes up on event
+2. Updates procurement item status to passed/complete
+3. No escalation needed
+
+| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State |
+|---------|------|-------------------|---------------------|-------------------|
+| `passed_01` | Steel Beams | `shipment_arrived` | `update_procurement_item_tool` | `status="inspection_passed"` |
+| `passed_02` | Windows | `shipment_arrived` | Same | Same |
+
+**Grading:**
+- Required tools: `update_procurement_item_tool`
+- Forbidden tools: `wait_for_human`, `flag_potential_issue` (should NOT escalate on success)
+- DB: procurement item has correct status
+
+---
+
+## Summary
+
+| Event Type | # Tests | Key Focus |
+|------------|---------|-----------|
+| Submittal_Approved | 2 | PO issued, DB entry created |
+| Shipment_Departed | 6 | **False positive detection** - forbidden tool calls |
+| Shipment_Arrived | 2 | Team notified, inspection scheduled |
+| Inspection_Failed | 3 | Human-in-the-loop scenarios |
+| Inspection_Passed | 2 | Simple DB update, no escalation |
+| **Total** | **15** | |
+
+## Grader Approach
+
+- **Code-based only** - no LLM judges needed
+- Required tool calls (must appear in transcript)
+- Forbidden tool calls (must NOT appear) - catches false positives
+- DB state assertions scoped to workflow_id
+
+## Key Design Decisions
+
+1. **Grade outcomes, not paths** - Don't require specific tool call order, just verify required tools were called and DB state is correct
+2. **Forbidden tools catch regressions** - Explicitly assert certain tools should NOT be called in specific scenarios
+3. **Isolated tests** - Each test gets unique workflow_id, no shared state
+4. **Human simulation** - Use same signal path as UI to simulate human responses