Skip to content

Commit 337156e

Browse files
committed
Update trace analyzer to focus on agent process
1 parent 815129f commit 337156e

File tree

2 files changed

+54
-33
lines changed

2 files changed

+54
-33
lines changed

evals/git-evals2/run-git-evals2.ts

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -201,6 +201,13 @@ export async function runGitEvals2(options: {
201201
spec: commit.spec,
202202
timestamp: new Date().toISOString(),
203203
analysis,
204+
results: commitTraces.map((t) => ({
205+
agentId: t.agentId,
206+
...t.judgeResult,
207+
cost: t.cost,
208+
durationMs: t.durationMs,
209+
error: t.error,
210+
})),
204211
}
205212

206213
fs.writeFileSync(analysisPath, JSON.stringify(analysisData, null, 2))

evals/git-evals2/trace-analyzer.ts

Lines changed: 47 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -161,32 +161,39 @@ const traceAnalyzerAgent: AgentDefinition = {
161161
},
162162
required: ['overallAnalysis', 'agentFeedback', 'recommendations'],
163163
},
164-
systemPrompt: `You are an expert AI agent evaluator comparing multiple coding agents on the same task.
164+
systemPrompt: `You are an expert AI agent evaluator analyzing how different coding agents approach problems and make decisions.
165165
166166
## Your Role
167167
168168
You will receive:
169-
1. A task specification
170-
2. Full traces from each agent showing their approach and execution
171-
3. Results including:
172-
- Judge results (completion score, code quality score, overall score, analysis, strengths, weaknesses)
173-
- Cost efficiency
174-
- Time efficiency
175-
- Whether they produced valid diffs
176-
- Any errors encountered
177-
- Number of trace steps taken
178-
179-
## Analysis Criteria
169+
1. A task specification (for context only)
170+
2. Full traces from each agent showing their step-by-step process
171+
3. Performance metrics (scores, cost, time, errors)
172+
173+
## Focus on Agent Processes
174+
175+
Your analysis should focus on how agents work, not what they accomplished:
176+
177+
Key Analysis Areas:
178+
- Problem-Solving Approach: How did each agent break down and approach the problem?
179+
- Tool Usage Patterns: Which tools did they use, in what sequence, and why?
180+
- Decision-Making Strategy: What information did they gather before acting? How did they validate assumptions?
181+
- Workflow Efficiency: Did they follow a systematic process or jump around? Were steps logically ordered?
182+
- Context Gathering: How thoroughly did they explore the codebase before making changes?
183+
- Iterative Refinement: Did they test, verify, or refine their work? How?
184+
185+
## Output Format
180186
181187
Provide:
182-
- **Overall Analysis**: Compare how agents performed on this task, analyzing their different approaches
183-
- **Agent Feedback**: For each agent, list:
184-
- Strengths: What this agent did well (specific actions from trace)
185-
- Weaknesses: What this agent struggled with (specific issues from trace)
186-
- Relative Performance: How this agent compared to others
187-
- **Recommendations**: Actionable suggestions for improving the agents based on observed behavior
188-
189-
Focus on comparative insights - how agents differ in their approaches, tool usage patterns, efficiency, and results.
188+
- Overall Analysis: Compare agent workflows, highlighting different process strategies
189+
- Agent Feedback: For each agent:
190+
- Strengths: Process steps that worked well (e.g., thoroughly explored codebase before editing)
191+
- Weaknesses: Process gaps or inefficiencies (e.g., made changes without reading related files)
192+
- Relative Performance: How this agent's process compared to others
193+
- Recommendations: Generalizable improvements to agent workflows and decision-making processes
194+
195+
Important: Focus on the agent's process and methodology, not on the object-level content of the code changes. We want to understand how to improve the agent's approach to any problem.
196+
190197
Note: read_files tool results show [TRUNCATED] for file contents to save space.`,
191198
}
192199

@@ -208,24 +215,31 @@ export async function analyzeAgentTraces({
208215
error: t.error,
209216
}))
210217

211-
const prompt = `## Task Specification
218+
const prompt = `## Task Specification (for context)
212219
${spec}
213220
214221
## Agent Traces and Results
215222
${JSON.stringify(truncatedTraces, null, 2)}
216223
217-
Please compare these agents and provide:
218-
1. An overall analysis of how the agents performed, including differences in their approaches
219-
2. Specific feedback for each agent including strengths, weaknesses, and how they performed relative to others
220-
3. Recommendations for improving the agents
221-
222-
Focus on:
223-
- Judge results (completion score, code quality score, overall score, analysis, strengths, weaknesses)
224-
- Approach and tool usage patterns from the traces
225-
- Cost efficiency
226-
- Time efficiency
227-
- Whether they produced valid diffs
228-
- Any errors encountered`
224+
Analyze how these agents approached the problem, focusing on their processes and workflows rather than the specific task:
225+
226+
1. Overall Process Comparison: How did agents differ in their problem-solving approach?
227+
- What was their overall strategy/workflow?
228+
- How did they sequence their actions?
229+
- What patterns emerged in how they gathered context vs. taking action?
230+
231+
2. Per-Agent Process Analysis: For each agent, identify:
232+
- Process strengths: What systematic steps or decisions worked well?
233+
- Process weaknesses: Where did their workflow have gaps or inefficiencies?
234+
- Key differences: How did this agent's process differ from others?
235+
236+
3. Generalizable Recommendations: Suggest improvements to agent workflows that would help on any task:
237+
- Better context-gathering strategies
238+
- More effective tool usage patterns
239+
- Improved decision-making processes
240+
- Workflow optimizations
241+
242+
Focus on the HOW, not the WHAT: We want to understand and improve how agents work, not evaluate their specific code output.`
229243

230244
const analyzerResult = await client.run({
231245
agent: 'git-evals2-trace-analyzer',

0 commit comments

Comments
 (0)