Skip to content

Commit c6cd5c3

Browse files
Chibi Vikramclaude
andcommitted
fix: legacy evaluation reporting with Strategy Pattern
This PR fixes legacy evaluation reporting to the backend that was returning HTTP 400 errors and implements the Strategy Pattern for cleaner code separation. ## Changes ### Strategy Pattern Implementation - Created `EvalReportingStrategy` Protocol defining the interface for evaluation reporting strategies - Implemented `LegacyEvalReportingStrategy` for legacy evaluations: - Converts string IDs to deterministic GUIDs using uuid5 - Uses endpoints without /coded/ prefix - Uses assertionRuns format with assertionSnapshot - Implemented `CodedEvalReportingStrategy` for coded evaluations: - Keeps IDs as strings - Uses /coded/ endpoint prefix - Uses evaluatorRuns format with evaluationCriterias ### Bug Fixes - Fixed legacy eval API payload structure for backend compatibility - Added type assertion for project_id to fix mypy errors - Removed unused ABC, abstractmethod imports after Protocol migration ### Test Results - All 27 unit tests passing - All linting checks (ruff, mypy) passing - Integration testing with calculator sample: all API calls returning HTTP 200 OK 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1 parent d546ef4 commit c6cd5c3

File tree

7 files changed

+844
-805
lines changed

7 files changed

+844
-805
lines changed

samples/calculator/evaluations/eval-sets/legacy.json

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
{
2-
"fileName": "default.json",
3-
"id": "default-eval-set-id",
4-
"name": "Basic Calculator Evaluation Set",
2+
"fileName": "legacy.json",
3+
"id": "a1b2c3d4-e5f6-4a89-abcd-ef0123456789",
4+
"name": "Basic Calculator Evaluation Set (Legacy)",
55
"batchSize": 10,
66
"evaluatorRefs": [
7-
"equality",
8-
"llm-as-a-judge",
9-
"json-similarity",
10-
"trajectory"
7+
"aaaaaaaa-aaaa-4aaa-aaaa-aaaaaaaaaaaa",
8+
"bbbbbbbb-bbbb-4bbb-bbbb-bbbbbbbbbbbb",
9+
"cccccccc-cccc-4ccc-cccc-cccccccccccc",
10+
"dddddddd-dddd-4ddd-dddd-dddddddddddd"
1111
],
1212
"evaluations": [
1313
{
14-
"id": "test-addition",
14+
"id": "11111111-1111-4111-8111-111111111111",
1515
"name": "Test Addition",
1616
"inputs": {
1717
"a": 1,
@@ -22,12 +22,12 @@
2222
"result": 2.0
2323
},
2424
"expectedAgentBehavior": "The operation should produce the right output.",
25-
"evalSetId": "default-eval-set-id",
25+
"evalSetId": "a1b2c3d4-e5f6-4a89-abcd-ef0123456789",
2626
"createdAt": "2025-09-04T18:54:58.378Z",
2727
"updatedAt": "2025-09-04T18:55:55.416Z"
2828
},
2929
{
30-
"id": "test-random-addition-using-llm",
30+
"id": "22222222-2222-4222-8222-222222222222",
3131
"name": "Test Random Addition Using LLM",
3232
"inputs": {
3333
"a": 1,
@@ -45,12 +45,12 @@
4545
"name": "get_random_operator"
4646
}
4747
],
48-
"evalSetId": "default-eval-set-id",
48+
"evalSetId": "a1b2c3d4-e5f6-4a89-abcd-ef0123456789",
4949
"createdAt": "2025-09-04T18:54:58.378Z",
5050
"updatedAt": "2025-09-04T18:55:55.416Z"
5151
},
5252
{
53-
"id": "test-with-llm-input-mocking",
53+
"id": "33333333-3333-4333-8333-333333333333",
5454
"name": "Test with LLM input mocking",
5555
"inputs": {},
5656
"expectedOutput": {
@@ -59,7 +59,7 @@
5959
"expectedAgentBehavior": "The operation should produce the right output.",
6060
"simulateInput": true,
6161
"inputGenerationInstructions": "Generate a multiplication calculation where the first number is 5 and the second number is 7",
62-
"evalSetId": "default-eval-set-id",
62+
"evalSetId": "a1b2c3d4-e5f6-4a89-abcd-ef0123456789",
6363
"createdAt": "2025-09-04T18:54:58.378Z",
6464
"updatedAt": "2025-09-04T18:55:55.416Z"
6565
}

samples/calculator/evaluations/evaluators/legacy-equality.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"fileName": "equality.json",
3-
"id": "equality",
2+
"fileName": "legacy-equality.json",
3+
"id": "aaaaaaaa-aaaa-4aaa-aaaa-aaaaaaaaaaaa",
44
"name": "Equality Evaluator",
55
"description": "An evaluator that judges the agent based on expected output.",
66
"category": 0,

samples/calculator/evaluations/evaluators/legacy-json-similarity.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"fileName": "json-similarity.json",
3-
"id": "json-similarity",
2+
"fileName": "legacy-json-similarity.json",
3+
"id": "cccccccc-cccc-4ccc-cccc-cccccccccccc",
44
"name": "JSON Similarity Evaluator",
55
"description": "An evaluator that compares JSON structures with tolerance for numeric and string differences.",
66
"category": 0,

samples/calculator/evaluations/evaluators/legacy-llm-as-a-judge.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"fileName": "llm-as-a-judge.json",
3-
"id": "llm-as-a-judge",
2+
"fileName": "legacy-llm-as-a-judge.json",
3+
"id": "bbbbbbbb-bbbb-4bbb-bbbb-bbbbbbbbbbbb",
44
"name": "LLMAsAJudge Evaluator",
55
"description": "An evaluator that judges the agent based on it's run history and expected behavior",
66
"category": 3,

samples/calculator/evaluations/evaluators/legacy-trajectory.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
2-
"fileName": "trajectory.json",
3-
"id": "trajectory",
2+
"fileName": "legacy-trajectory.json",
3+
"id": "dddddddd-dddd-4ddd-dddd-dddddddddddd",
44
"name": "Trajectory Evaluator",
55
"description": "An evaluator that analyzes the execution trajectory and decision sequence taken by the agent.",
66
"category": 3,

0 commit comments

Comments
 (0)