Skip to content

Commit 055f94c

Browse files
committed
Add evals
1 parent 7d2aeda commit 055f94c

22 files changed

+3564
-1
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Procurement Agent Eval Framework Design
2+
3+
## Overview
4+
5+
Integration tests that run against a live Temporal workflow to verify the procurement agent behaves correctly. Tests send real events via signals, capture tool call transcripts, and verify database state changes.
6+
7+
## Architecture
8+
9+
```
10+
evals/
11+
├── conftest.py # Pytest fixtures: workflow setup, DB helpers, temporal client
12+
├── graders/
13+
│ ├── tool_calls.py # Verify required/forbidden tool calls in transcript
14+
│ └── database.py # Verify DB state changes
15+
├── tasks/
16+
│ ├── test_submittal_approved.py
17+
│ ├── test_shipment_departed.py # Multiple cases for false positive issue
18+
│ ├── test_shipment_arrived.py
19+
│ ├── test_inspection_failed.py # Human-in-the-loop scenarios
20+
│ └── test_inspection_passed.py
21+
└── fixtures/
22+
└── events.py # Pre-built event payloads for each scenario
23+
```
24+
25+
## Test Flow (per task)
26+
27+
1. Spin up fresh workflow via Temporal client
28+
2. Send event signal(s) to workflow
29+
3. Wait for agent to process (poll for completion or timeout)
30+
4. **Capture transcript** - extract tool calls from workflow history
31+
5. **Query DB** - check procurement_items and schedule tables
32+
6. **Grade** - assert required tool calls present, forbidden calls absent, DB state correct
33+
34+
## Isolation
35+
36+
- Each test gets a unique `workflow_id`
37+
- DB queries scoped to that workflow_id
38+
- No shared state between tests
39+
40+
---
41+
42+
## Test Cases
43+
44+
### Submittal_Approved
45+
46+
**What should happen:**
47+
1. Agent wakes up on event
48+
2. Issues purchase order (tool call)
49+
3. Creates procurement item in DB with status + PO ID
50+
51+
| Test ID | Item | Expected Tool Calls | Expected DB State |
52+
|---------|------|---------------------|-------------------|
53+
| `submittal_01` | Steel Beams | `issue_purchase_order`, `create_procurement_item_tool` | `status="purchase_order_issued"`, `purchase_order_id` not null |
54+
| `submittal_02` | HVAC Units | Same | Same |
55+
56+
**Grading:**
57+
- Required tools: `issue_purchase_order`, `create_procurement_item_tool`
58+
- DB: procurement item exists with correct status and PO ID
59+
60+
---
61+
62+
### Shipment_Departed_Factory (Critical - False Positive Issue)
63+
64+
**What should happen:**
65+
1. Agent wakes up, ingests ETA
66+
2. Cross-references with master schedule
67+
3. **Only flag if ETA ≥ required_by** (zero/negative buffer)
68+
4. Update procurement item with ETA and status
69+
70+
**Conflict Logic:**
71+
- Flag if ETA >= required_by (arriving on or after deadline)
72+
- Don't flag if ETA < required_by (arriving before deadline is OK)
73+
74+
| Test ID | Item | ETA | Required By | Should Flag? | Rationale |
75+
|---------|------|-----|-------------|--------------|-----------|
76+
| `departed_01_no_flag` | Steel Beams | 2026-02-10 | 2026-02-15 | **NO** | 5 days early - well within buffer |
77+
| `departed_02_no_flag` | Steel Beams | 2026-02-14 | 2026-02-15 | **NO** | 1 day early - still OK |
78+
| `departed_03_flag` | Steel Beams | 2026-02-15 | 2026-02-15 | **YES** | Arrives ON deadline - zero buffer |
79+
| `departed_04_flag` | Steel Beams | 2026-02-20 | 2026-02-15 | **YES** | 5 days LATE - definite conflict |
80+
| `departed_05_no_flag` | Windows | 2026-03-05 | 2026-03-15 | **NO** | 10 days early - uses buffer but OK |
81+
| `departed_06_no_flag` | HVAC Units | 2026-02-28 | 2026-03-01 | **NO** | 1 day early - boundary case, still OK |
82+
83+
**Grading:**
84+
- Always required: `update_procurement_item_tool`
85+
- If ETA >= required_by: also require `flag_potential_issue`
86+
- If ETA < required_by: `flag_potential_issue` is **forbidden** (catches false positives)
87+
- DB: procurement item has correct ETA and status
88+
89+
---
90+
91+
### Shipment_Arrived_Site
92+
93+
**What should happen:**
94+
1. Agent wakes up on arrival event
95+
2. Notifies receiving team
96+
3. Schedules inspection
97+
4. Updates procurement item with arrival date and status
98+
99+
| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State |
100+
|---------|------|-------------------|---------------------|-------------------|
101+
| `arrived_01` | Steel Beams | `shipment_departed` | `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool` | `status="shipment_arrived"`, `date_arrived` set |
102+
| `arrived_02` | Windows | `shipment_departed` | Same | Same |
103+
104+
**Grading:**
105+
- Required tools: `notify_team_shipment_arrived`, `schedule_inspection`, `update_procurement_item_tool`
106+
- DB: procurement item has status and arrival date set
107+
108+
---
109+
110+
### Inspection_Failed (Human-in-the-Loop)
111+
112+
**What should happen:**
113+
1. Agent wakes up, analyzes failure
114+
2. Escalates to human with recommended action (`wait_for_human` tool)
115+
3. Pauses workflow until human responds
116+
4. On human response, executes appropriate actions
117+
118+
**Simulating Human Input:**
119+
Send human responses via the `RECEIVE_EVENT` signal (same path as UI), which populates `human_queue`.
120+
121+
| Test ID | Scenario | Human Response | Expected Tool Calls | Expected DB State |
122+
|---------|----------|----------------|---------------------|-------------------|
123+
| `failed_01_approve` | Human approves recommendation | `"Yes"` | `wait_for_human`, then agent's recommended tools | Status updated per recommendation |
124+
| `failed_02_approve_plus` | Human approves + extra action | `"Yes, and also update the delivery date to 2026-03-15"` | `wait_for_human`, `update_delivery_date_tool`, `update_procurement_item_tool` | Status updated + schedule delivery date changed |
125+
| `failed_03_reject_delete` | Human rejects, wants deletion | `"No, remove it from the master schedule entirely"` | `wait_for_human`, `remove_delivery_item_tool`, `delete_procurement_item_tool` | Item removed from schedule, procurement item deleted |
126+
127+
**Test Flow:**
128+
1. Setup: ensure item exists in DB (from prior events)
129+
2. Send inspection failed event
130+
3. Wait for agent to call `wait_for_human` (poll workflow state)
131+
4. Send human response via signal
132+
5. Wait for agent to complete processing
133+
6. Grade: check transcript + DB
134+
135+
---
136+
137+
### Inspection_Passed
138+
139+
**What should happen:**
140+
1. Agent wakes up on event
141+
2. Updates procurement item status to passed/complete
142+
3. No escalation needed
143+
144+
| Test ID | Item | Prerequisite State | Expected Tool Calls | Expected DB State |
145+
|---------|------|-------------------|---------------------|-------------------|
146+
| `passed_01` | Steel Beams | `shipment_arrived` | `update_procurement_item_tool` | `status="inspection_passed"` |
147+
| `passed_02` | Windows | `shipment_arrived` | Same | Same |
148+
149+
**Grading:**
150+
- Required tools: `update_procurement_item_tool`
151+
- Forbidden tools: `wait_for_human`, `flag_potential_issue` (should NOT escalate on success)
152+
- DB: procurement item has correct status
153+
154+
---
155+
156+
## Summary
157+
158+
| Event Type | # Tests | Key Focus |
159+
|------------|---------|-----------|
160+
| Submittal_Approved | 2 | PO issued, DB entry created |
161+
| Shipment_Departed | 6 | **False positive detection** - forbidden tool calls |
162+
| Shipment_Arrived | 2 | Team notified, inspection scheduled |
163+
| Inspection_Failed | 3 | Human-in-the-loop scenarios |
164+
| Inspection_Passed | 2 | Simple DB update, no escalation |
165+
| **Total** | **15** | |
166+
167+
## Grader Approach
168+
169+
- **Code-based only** - no LLM judges needed
170+
- Required tool calls (must appear in transcript)
171+
- Forbidden tool calls (must NOT appear) - catches false positives
172+
- DB state assertions scoped to workflow_id
173+
174+
## Key Design Decisions
175+
176+
1. **Grade outcomes, not paths** - Don't require specific tool call order, just verify required tools were called and DB state is correct
177+
2. **Forbidden tools catch regressions** - Explicitly assert certain tools should NOT be called in specific scenarios
178+
3. **Isolated tests** - Each test gets unique workflow_id, no shared state
179+
4. **Human simulation** - Use same signal path as UI to simulate human responses

0 commit comments

Comments
 (0)