|
| 1 | +# git-evals2 |
| 2 | + |
| 3 | +A simplified evaluation system for comparing Codebuff agents on git commit tasks. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +git-evals2 is a streamlined rewrite of the original git-evals system, inspired by the subagents evals (eval-planner and test-repo-utils). It focuses on simplicity and ease of use while maintaining the core functionality of agent evaluation. |
| 8 | + |
| 9 | +## Key Simplifications |
| 10 | + |
| 11 | +Compared to the original git-evals: |
| 12 | + |
| 13 | +- **No child processes**: Runs everything in-process with async/await |
| 14 | +- **No prompting agent**: Single-shot execution - agent gets the spec once and runs until done |
| 15 | +- **Codebuff agents only**: Uses the SDK client exclusively (no Claude runner) |
| 16 | +- **No trace in judging**: Judge only sees final file changes vs ground truth (not agent execution steps) |
| 17 | +- **Function-based API**: Simple exported function instead of CLI with complex process management |
| 18 | +- **Minimal metadata**: Only tracks essential metrics (diff, duration, cost, optional error) |
| 19 | + |
| 20 | +## Usage |
| 21 | + |
| 22 | +```typescript |
| 23 | +import { runGitEvals2 } from './evals/git-evals2/run-git-evals2' |
| 24 | + |
| 25 | +const results = await runGitEvals2({ |
| 26 | + evalDataPath: 'evals/git-evals/eval-codebuff2.json', |
| 27 | + agents: ['base', 'base-lite'], |
| 28 | + outputPath: 'evals/git-evals2/results.json', |
| 29 | + limit: 5, |
| 30 | + onProgress: (event) => { |
| 31 | + if (event.type === 'agent_complete') { |
| 32 | + console.log(`${event.agent} completed with score ${event.score}`) |
| 33 | + } |
| 34 | + }, |
| 35 | +}) |
| 36 | + |
| 37 | +console.log('Average scores:', { |
| 38 | + base: results.agents.get('base')?.averageScore, |
| 39 | + 'base-lite': results.agents.get('base-lite')?.averageScore, |
| 40 | +}) |
| 41 | +``` |
| 42 | + |
| 43 | +## API |
| 44 | + |
| 45 | +### `runGitEvals2(options: GitEvals2Options): Promise<GitEvals2Result>` |
| 46 | + |
| 47 | +#### Options |
| 48 | + |
| 49 | +- `evalDataPath` (string): Path to eval JSON file with commits |
| 50 | +- `agents` (string[]): Array of agent IDs to compare (e.g., ['base', 'base-lite']) |
| 51 | +- `outputPath?` (string): Optional path to write results JSON |
| 52 | +- `limit?` (number): Optional max number of commits to evaluate |
| 53 | +- `onProgress?` (callback): Optional progress event handler |
| 54 | +- `client?` (CodebuffClient): Optional SDK client override (useful for testing) |
| 55 | + |
| 56 | +#### Result |
| 57 | + |
| 58 | +```typescript |
| 59 | +interface GitEvals2Result { |
| 60 | + agents: Map<string, AgentEvalResults> |
| 61 | + timestamp: string |
| 62 | + totalDuration: number |
| 63 | +} |
| 64 | + |
| 65 | +interface AgentEvalResults { |
| 66 | + agentId: string |
| 67 | + runs: EvalRun[] |
| 68 | + averageScore: number |
| 69 | + averageCost: number |
| 70 | + averageDuration: number |
| 71 | +} |
| 72 | + |
| 73 | +interface EvalRun { |
| 74 | + commitSha: string |
| 75 | + spec: string |
| 76 | + diff: string |
| 77 | + judgeScore: number |
| 78 | + judgeFeedback: string |
| 79 | + cost: number |
| 80 | + durationMs: number |
| 81 | + error?: string |
| 82 | +} |
| 83 | +``` |
| 84 | + |
| 85 | +## How It Differs |
| 86 | + |
| 87 | +### Architecture |
| 88 | + |
| 89 | +- **Original**: Fork child processes for each eval, complex IPC communication |
| 90 | +- **git-evals2**: Simple async functions with Promise.all for parallelism |
| 91 | + |
| 92 | +### Execution |
| 93 | + |
| 94 | +- **Original**: Multi-turn conversations with prompting agent deciding continue/complete/halt |
| 95 | +- **git-evals2**: Single-shot - agent gets spec and runs until done or timeout |
| 96 | + |
| 97 | +### Judging |
| 98 | + |
| 99 | +- **Original**: Judge sees spec + agent trace + final diff, 3 judges with median selection |
| 100 | +- **git-evals2**: Judge only sees spec + final diff (no trace), single judge call |
| 101 | + |
| 102 | +### State Management |
| 103 | + |
| 104 | +- **Original**: Complex SessionState threading, manual state updates |
| 105 | +- **git-evals2**: SDK handles state internally, minimal metadata tracking |
| 106 | + |
| 107 | +### Error Handling |
| 108 | + |
| 109 | +- **Original**: Process-level handlers, signal management, cleanup logic |
| 110 | +- **git-evals2**: Standard try-catch, continues on errors, records them in results |
| 111 | + |
| 112 | +## Module Structure |
| 113 | + |
| 114 | +- `run-git-evals2.ts`: Main orchestration function |
| 115 | +- `agent-runner.ts`: Executes single agent on a commit |
| 116 | +- `judge.ts`: Judges file changes without trace |
| 117 | +- `types.ts`: Type definitions |
| 118 | +- `example.ts`: Example usage |
| 119 | + |
| 120 | +## Benefits |
| 121 | + |
| 122 | +- **Simpler codebase**: ~90% less code than original system |
| 123 | +- **Faster execution**: Less overhead from process management |
| 124 | +- **Easier debugging**: Everything in-process with standard async/await |
| 125 | +- **More maintainable**: Clear separation of concerns, modular design |
| 126 | +- **Still powerful**: Maintains core evaluation functionality |
0 commit comments