Skip to content

Commit f963c13

Browse files
committed
First version of git-evals2
1 parent d0b8570 commit f963c13

File tree

8 files changed

+682
-7
lines changed

8 files changed

+682
-7
lines changed

evals/git-evals/eval-codebuff2.json

Lines changed: 1 addition & 6 deletions
Large diffs are not rendered by default.

evals/git-evals/logs/codebuff-yw_Q5Gr1Tls/eval-commit-212590d.json

Lines changed: 0 additions & 1 deletion
This file was deleted.

evals/git-evals2/README.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# git-evals2
2+
3+
A simplified evaluation system for comparing Codebuff agents on git commit tasks.
4+
5+
## Overview
6+
7+
git-evals2 is a streamlined rewrite of the original git-evals system, inspired by the subagents evals (eval-planner and test-repo-utils). It focuses on simplicity and ease of use while maintaining the core functionality of agent evaluation.
8+
9+
## Key Simplifications
10+
11+
Compared to the original git-evals:
12+
13+
- **No child processes**: Runs everything in-process with async/await
14+
- **No prompting agent**: Single-shot execution - agent gets the spec once and runs until done
15+
- **Codebuff agents only**: Uses the SDK client exclusively (no Claude runner)
16+
- **No trace in judging**: Judge only sees final file changes vs ground truth (not agent execution steps)
17+
- **Function-based API**: Simple exported function instead of CLI with complex process management
18+
- **Minimal metadata**: Only tracks essential metrics (diff, duration, cost, optional error)
19+
20+
## Usage
21+
22+
```typescript
23+
import { runGitEvals2 } from './evals/git-evals2/run-git-evals2'
24+
25+
const results = await runGitEvals2({
26+
evalDataPath: 'evals/git-evals/eval-codebuff2.json',
27+
agents: ['base', 'base-lite'],
28+
outputPath: 'evals/git-evals2/results.json',
29+
limit: 5,
30+
onProgress: (event) => {
31+
if (event.type === 'agent_complete') {
32+
console.log(`${event.agent} completed with score ${event.score}`)
33+
}
34+
},
35+
})
36+
37+
console.log('Average scores:', {
38+
base: results.agents.get('base')?.averageScore,
39+
'base-lite': results.agents.get('base-lite')?.averageScore,
40+
})
41+
```
42+
43+
## API
44+
45+
### `runGitEvals2(options: GitEvals2Options): Promise<GitEvals2Result>`
46+
47+
#### Options
48+
49+
- `evalDataPath` (string): Path to eval JSON file with commits
50+
- `agents` (string[]): Array of agent IDs to compare (e.g., ['base', 'base-lite'])
51+
- `outputPath?` (string): Optional path to write results JSON
52+
- `limit?` (number): Optional max number of commits to evaluate
53+
- `onProgress?` (callback): Optional progress event handler
54+
- `client?` (CodebuffClient): Optional SDK client override (useful for testing)
55+
56+
#### Result
57+
58+
```typescript
59+
interface GitEvals2Result {
60+
agents: Map<string, AgentEvalResults>
61+
timestamp: string
62+
totalDuration: number
63+
}
64+
65+
interface AgentEvalResults {
66+
agentId: string
67+
runs: EvalRun[]
68+
averageScore: number
69+
averageCost: number
70+
averageDuration: number
71+
}
72+
73+
interface EvalRun {
74+
commitSha: string
75+
spec: string
76+
diff: string
77+
judgeScore: number
78+
judgeFeedback: string
79+
cost: number
80+
durationMs: number
81+
error?: string
82+
}
83+
```
84+
85+
## How It Differs
86+
87+
### Architecture
88+
89+
- **Original**: Fork child processes for each eval, complex IPC communication
90+
- **git-evals2**: Simple async functions with Promise.all for parallelism
91+
92+
### Execution
93+
94+
- **Original**: Multi-turn conversations with prompting agent deciding continue/complete/halt
95+
- **git-evals2**: Single-shot - agent gets spec and runs until done or timeout
96+
97+
### Judging
98+
99+
- **Original**: Judge sees spec + agent trace + final diff, 3 judges with median selection
100+
- **git-evals2**: Judge only sees spec + final diff (no trace), single judge call
101+
102+
### State Management
103+
104+
- **Original**: Complex SessionState threading, manual state updates
105+
- **git-evals2**: SDK handles state internally, minimal metadata tracking
106+
107+
### Error Handling
108+
109+
- **Original**: Process-level handlers, signal management, cleanup logic
110+
- **git-evals2**: Standard try-catch, continues on errors, records them in results
111+
112+
## Module Structure
113+
114+
- `run-git-evals2.ts`: Main orchestration function
115+
- `agent-runner.ts`: Executes single agent on a commit
116+
- `judge.ts`: Judges file changes without trace
117+
- `types.ts`: Type definitions
118+
- `example.ts`: Example usage
119+
120+
## Benefits
121+
122+
- **Simpler codebase**: ~90% less code than original system
123+
- **Faster execution**: Less overhead from process management
124+
- **Easier debugging**: Everything in-process with standard async/await
125+
- **More maintainable**: Clear separation of concerns, modular design
126+
- **Still powerful**: Maintains core evaluation functionality

evals/git-evals2/agent-runner.ts

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
import { execSync } from 'child_process'
2+
import path from 'path'
3+
4+
import { loadLocalAgents } from '@codebuff/npm-app/agents/load-agents'
5+
import { CodebuffClient } from '../../sdk/src/client'
6+
import { withTestRepo } from '../subagents/test-repo-utils'
7+
8+
import type { EvalCommit } from './types'
9+
10+
export interface AgentRunResult {
11+
diff: string
12+
durationMs: number
13+
cost: number
14+
error?: string
15+
}
16+
17+
export async function runAgentOnCommit({
18+
client,
19+
agentId,
20+
commit,
21+
repoUrl,
22+
initCommand,
23+
}: {
24+
client: CodebuffClient
25+
agentId: string
26+
commit: EvalCommit
27+
repoUrl: string
28+
initCommand?: string
29+
}): Promise<AgentRunResult> {
30+
const startTime = Date.now()
31+
let diff = ''
32+
let error: string | undefined
33+
let cost = 0
34+
35+
try {
36+
await withTestRepo(
37+
{
38+
repoUrl,
39+
commitSha: commit.sha,
40+
initCommand,
41+
checkoutPrevious: true,
42+
},
43+
async (repoDir) => {
44+
const agentsPath = path.join(__dirname, '../../.agents')
45+
const localAgentDefinitions = Object.values(
46+
await loadLocalAgents({ agentsPath }),
47+
)
48+
49+
const result = await client.run({
50+
agent: agentId,
51+
prompt: commit.spec,
52+
agentDefinitions: localAgentDefinitions,
53+
cwd: repoDir,
54+
})
55+
56+
cost = result.sessionState.mainAgentState.creditsUsed / 100
57+
58+
execSync('git add .', { cwd: repoDir, stdio: 'ignore' })
59+
diff = execSync('git diff HEAD', {
60+
cwd: repoDir,
61+
encoding: 'utf-8',
62+
})
63+
},
64+
)
65+
} catch (e) {
66+
error = e instanceof Error ? `${e.message}\n${e.stack}` : String(e)
67+
}
68+
69+
const durationMs = Date.now() - startTime
70+
71+
return {
72+
diff,
73+
durationMs,
74+
cost,
75+
error,
76+
}
77+
}

evals/git-evals2/example.ts

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
import path from 'path'
2+
import { runGitEvals2 } from './run-git-evals2'
3+
4+
async function main() {
5+
console.log('Running git-evals2 example...')
6+
console.log('Comparing base and base-lite agents on first 3 commits\n')
7+
8+
const results = await runGitEvals2({
9+
evalDataPath: path.join(__dirname, '../git-evals/eval-codebuff2.json'),
10+
agents: ['base', 'base-lite'],
11+
outputPath: path.join(__dirname, '../git-evals2/example-results.json'),
12+
limit: 3,
13+
onProgress: (event) => {
14+
if (event.type === 'agent_start') {
15+
console.log(
16+
`[${event.agent}] Starting on commit ${event.commit.slice(0, 7)}...`,
17+
)
18+
} else if (event.type === 'agent_complete') {
19+
console.log(
20+
`[${event.agent}] ✓ Completed with score ${event.score.toFixed(1)}/10`,
21+
)
22+
} else if (event.type === 'agent_error') {
23+
console.log(`[${event.agent}] ✗ Error: ${event.error}`)
24+
}
25+
},
26+
})
27+
28+
console.log('\n=== Final Results ===')
29+
console.log(`Total duration: ${(results.totalDuration / 1000).toFixed(1)}s\n`)
30+
31+
for (const [agentId, data] of results.agents) {
32+
console.log(`${agentId}:`)
33+
console.log(` Score: ${data.averageScore.toFixed(2)}/10`)
34+
console.log(` Cost: $${data.averageCost.toFixed(4)}`)
35+
console.log(` Duration: ${(data.averageDuration / 1000).toFixed(1)}s`)
36+
console.log(
37+
` Success: ${data.runs.filter((r) => !r.error).length}/${data.runs.length}`,
38+
)
39+
console.log()
40+
}
41+
}
42+
43+
if (import.meta.main) {
44+
main().catch((error) => {
45+
console.error('Error running example:', error)
46+
process.exit(1)
47+
})
48+
}

0 commit comments

Comments
 (0)