|
| 1 | +# Procurement Agent Eval Framework Design |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +Integration tests that run against a live Temporal workflow to verify the procurement agent behaves correctly. Tests send real events via signals, capture tool call transcripts, and verify database state changes. |
| 6 | + |
| 7 | +## Architecture |
| 8 | + |
| 9 | +``` |
| 10 | +evals/ |
| 11 | +├── conftest.py # Pytest fixtures: workflow setup, DB helpers, temporal client |
| 12 | +├── graders/ |
| 13 | +│ ├── tool_calls.py # Verify required/forbidden tool calls in transcript |
| 14 | +│ └── database.py # Verify DB state changes |
| 15 | +├── tasks/ |
| 16 | +│ ├── test_submittal_approved.py |
| 17 | +│ ├── test_shipment_departed.py # Multiple cases for false positive issue |
| 18 | +│ ├── test_shipment_arrived.py |
| 19 | +│ ├── test_inspection_failed.py # Human-in-the-loop scenarios |
| 20 | +│ └── test_inspection_passed.py |
| 21 | +└── fixtures/ |
| 22 | + └── events.py # Pre-built event payloads for each scenario |
| 23 | +``` |
| 24 | + |
| 25 | +## Test Flow (per task) |
| 26 | + |
| 27 | +1. Spin up fresh workflow via Temporal client |
| 28 | +2. Send event signal(s) to workflow |
| 29 | +3. Wait for agent to process (poll for completion or timeout) |
| 30 | +4. **Capture transcript** - extract tool calls from workflow history |
| 31 | +5. **Query DB** - check procurement_items and schedule tables |
| 32 | +6. **Grade** - assert required tool calls present, forbidden calls absent, DB state correct |
| 33 | + |
| 34 | +## Isolation |
| 35 | + |
| 36 | +- Each test gets a unique `workflow_id` |
| 37 | +- DB queries scoped to that workflow_id |
| 38 | +- No shared state between tests |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## Test Cases |
| 43 | + |
| 44 | +### Submittal_Approved |
| 45 | + |
| 46 | +**What should happen:** |
| 47 | +1. Agent wakes up on event |
| 48 | +2. Issues purchase order (tool call) |
| 49 | +3. Creates procurement item in DB with status + PO ID |
| 50 | + |
| 51 | +| Test ID | Item | Expected Tool Calls | Expected DB State | |
| 52 | +|---------|------|---------------------|-------------------| |
| 53 | +| `submittal_01` | Steel Beams | `issue_purchase_order`, `create_procurement_item_tool` | `status="purchase_order_issued"`, `purchase_order_id` not null | |
| 54 | +| `submittal_02` | HVAC Units | Same | Same | |
| 55 | + |
| 56 | +**Grading:** |
| 57 | +- Required tools: `issue_purchase_order`, `create_procurement_item_tool` |
| 58 | +- DB: procurement item exists with correct status and PO ID |
| 59 | + |
| 60 | +--- |
| 61 | + |
| 62 | +### Shipment_Departed_Factory (Critical - False Positive Issue) |
| 63 | + |
| 64 | +**What should happen:** |
| 65 | +1. Agent wakes up, ingests ETA |
| 66 | +2. Cross-references with master schedule |
| 67 | +3. **Only flag if ETA ≥ required_by** (zero/negative buffer) |
| 68 | +4. Update procurement item with ETA and status |
| 69 | + |
| 70 | +**Conflict Logic:** |
| 71 | +- Flag if ETA >= required_by (arriving on or after deadline) |
| 72 | +- Don't flag if ETA < required_by (arriving before deadline is OK) |
| 73 | + |
| 74 | +| Test ID | Item | ETA | Required By | Should Flag? | Rationale | |
| 75 | +|---------|------|-----|-------------|--------------|-----------| |
| 76 | +| `departed_01_no_flag` | Steel Beams | 2026-02-10 | 2026-02-15 | **NO** | 5 days early - well within buffer | |
| 77 | +| `departed_02_no_flag` | Steel Beams | 2026-02-14 | 2026-02-15 | **NO** | 1 day early - still OK | |
| 78 | +| `departed_03_flag` | Steel Beams | 2026-02-15 | 2026-02-15 | **YES** | Arrives ON deadline - zero buffer | |
| 79 | +| `departed_04_flag` | Steel Beams | 2026-02-20 | 2026-02-15 | **YES** | 5 days LATE - definite conflict | |
| 80 | +| `departed_05_no_flag` | Windows | 2026-03-05 | 2026-03-15 | **NO** | 10 days early - uses buffer but OK | |
| 81 | +| `departed_06_no_flag` | HVAC Units | 2026-02-28 | 2026-03-01 | **NO** | 1 day early - boundary case, still OK | |
| 82 | + |
| 83 | +**Grading:** |
| 84 | +- Always required: `update_procurement_item_tool` |
| 85 | +- If ETA >= required_by: also require `flag_potential_issue` |
| 86 | +- If ETA < required_by: `flag_potential_issue` is **forbidden** (catches false positives) |
| 87 | +- DB: procurement item has correct ETA and status |
| 88 | + |
| 89 | +--- |
| 90 | + |
| 91 | +### Shipment_Arrived_Site |
| 92 | + |
| 93 | +**What should happen:** |
| 94 | +1. Agent wakes up on arrival event |
| 95 | +2. Notifies receiving team |
| 96 | +3. Schedules inspection |
| 97 | +4. Updates procurement item with arrival date and status |
| 98 | + |
| 99 | +| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State | |
| 100 | +|---------|------|-------------------|---------------------|-------------------| |
| 101 | +| `arrived_01` | Steel Beams | `shipment_departed` | `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool` | `status="shipment_arrived"`, `date_arrived` set | |
| 102 | +| `arrived_02` | Windows | `shipment_departed` | Same | Same | |
| 103 | + |
| 104 | +**Grading:** |
| 105 | +- Required tools: `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool` |
| 106 | +- DB: procurement item has status and arrival date set |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +### Inspection_Failed (Human-in-the-Loop) |
| 111 | + |
| 112 | +**What should happen:** |
| 113 | +1. Agent wakes up, analyzes failure |
| 114 | +2. Escalates to human with recommended action (`wait_for_human` tool) |
| 115 | +3. Pauses workflow until human responds |
| 116 | +4. On human response, executes appropriate actions |
| 117 | + |
| 118 | +**Simulating Human Input:** |
| 119 | +Send human responses via the `RECEIVE_EVENT` signal (same path as UI), which populates `human_queue`. |
| 120 | + |
| 121 | +| Test ID | Scenario | Human Response | Expected Tool Calls | Expected DB State | |
| 122 | +|---------|----------|----------------|---------------------|-------------------| |
| 123 | +| `failed_01_approve` | Human approves recommendation | `"Yes"` | `wait_for_human`, then agent's recommended tools | Status updated per recommendation | |
| 124 | +| `failed_02_approve_plus` | Human approves + extra action | `"Yes, and also update the delivery date to 2026-03-15"` | `wait_for_human`, `update_delivery_date_tool`, `update_procurement_item_tool` | Status updated + schedule delivery date changed | |
| 125 | +| `failed_03_reject_delete` | Human rejects, wants deletion | `"No, remove it from the master schedule entirely"` | `wait_for_human`, `remove_delivery_item_tool`, `delete_procurement_item_tool` | Item removed from schedule, procurement item deleted | |
| 126 | + |
| 127 | +**Test Flow:** |
| 128 | +1. Setup: ensure item exists in DB (from prior events) |
| 129 | +2. Send inspection failed event |
| 130 | +3. Wait for agent to call `wait_for_human` (poll workflow state) |
| 131 | +4. Send human response via signal |
| 132 | +5. Wait for agent to complete processing |
| 133 | +6. Grade: check transcript + DB |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +### Inspection_Passed |
| 138 | + |
| 139 | +**What should happen:** |
| 140 | +1. Agent wakes up on event |
| 141 | +2. Updates procurement item status to passed/complete |
| 142 | +3. No escalation needed |
| 143 | + |
| 144 | +| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State | |
| 145 | +|---------|------|-------------------|---------------------|-------------------| |
| 146 | +| `passed_01` | Steel Beams | `shipment_arrived` | `update_procurement_item_tool` | `status="inspection_passed"` | |
| 147 | +| `passed_02` | Windows | `shipment_arrived` | Same | Same | |
| 148 | + |
| 149 | +**Grading:** |
| 150 | +- Required tools: `update_procurement_item_tool` |
| 151 | +- Forbidden tools: `wait_for_human`, `flag_potential_issue` (should NOT escalate on success) |
| 152 | +- DB: procurement item has correct status |
| 153 | + |
| 154 | +--- |
| 155 | + |
| 156 | +## Summary |
| 157 | + |
| 158 | +| Event Type | # Tests | Key Focus | |
| 159 | +|------------|---------|-----------| |
| 160 | +| Submittal_Approved | 2 | PO issued, DB entry created | |
| 161 | +| Shipment_Departed | 6 | **False positive detection** - forbidden tool calls | |
| 162 | +| Shipment_Arrived | 2 | Team notified, inspection scheduled | |
| 163 | +| Inspection_Failed | 3 | Human-in-the-loop scenarios | |
| 164 | +| Inspection_Passed | 2 | Simple DB update, no escalation | |
| 165 | +| **Total** | **15** | | |
| 166 | + |
| 167 | +## Grader Approach |
| 168 | + |
| 169 | +- **Code-based only** - no LLM judges needed |
| 170 | +- Required tool calls (must appear in transcript) |
| 171 | +- Forbidden tool calls (must NOT appear) - catches false positives |
| 172 | +- DB state assertions scoped to workflow_id |
| 173 | + |
| 174 | +## Key Design Decisions |
| 175 | + |
| 176 | +1. **Grade outcomes, not paths** - Don't require specific tool call order, just verify required tools were called and DB state is correct |
| 177 | +2. **Forbidden tools catch regressions** - Explicitly assert certain tools should NOT be called in specific scenarios |
| 178 | +3. **Isolated tests** - Each test gets unique workflow_id, no shared state |
| 179 | +4. **Human simulation** - Use same signal path as UI to simulate human responses |
0 commit comments