|
| 1 | +# Codebuff Evals |
| 2 | + |
| 3 | +This directory contains the evaluation framework for testing and measuring Codebuff's coding capabilities, with a focus on the innovative **Git Commit Reimplementation Evaluation** system. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The evaluation system takes a fundamentally different approach from traditional coding benchmarks like SWE Bench or Terminal Bench. Instead of passing predefined tests, our evaluations challenge coding agents to reimplement real git commits from open source projects over multiple interactive steps. |
| 8 | + |
| 9 | +### Core Idea: Commit Reconstruction Methodology |
| 10 | + |
| 11 | +Our evaluation framework centers on having coding agents reconstruct actual git commits from open source repositories through an interactive, multi-turn process. |
| 12 | + |
| 13 | +A specialized prompting agent guides the coding agent through up to 5 conversational rounds to implement a specification derived from the original commit's changes. |
| 14 | + |
| 15 | +The process concludes with an AI judge that provides comprehensive scoring by comparing the agent's implementation against the ground truth commit. |
| 16 | + |
| 17 | +This methodology enables nuanced evaluation across multiple dimensions: an agent might produce functionally correct code but receive lower scores for being unnecessarily verbose, failing to leverage existing helper functions, missing edge cases present in the original implementation, or taking an inefficient path with excessive revisions and mistakes. |
| 18 | + |
| 19 | +## Architecture |
| 20 | + |
| 21 | +### System Components |
| 22 | + |
| 23 | +1. **Evaluation Orchestration** (`run-git-evals.ts`, `run-eval-set.ts`) |
| 24 | + |
| 25 | + - Manages the complete evaluation pipeline |
| 26 | + - Handles concurrency and process management |
| 27 | + - Coordinates between all system components |
| 28 | + |
| 29 | +2. **Agent Runners** (`runners/`) |
| 30 | + |
| 31 | + - **Codebuff Runner**: Integrates with local Codebuff installation |
| 32 | + - **Claude Runner**: Integrates with Anthropic's Claude Code |
| 33 | + - **Runner Interface**: Common abstraction for all coding agents |
| 34 | + |
| 35 | +3. **Prompting Agent** (`prompting-agent.ts`) |
| 36 | + |
| 37 | + - Acts as the "human developer" in the loop |
| 38 | + - Analyzes conversation history and decides next actions |
| 39 | + - Generates follow-up prompts to guide the coding agent |
| 40 | + - Makes decisions: `continue`, `complete`, or `halt` |
| 41 | + |
| 42 | +4. **Judging System** (`judge-git-eval.ts`) |
| 43 | + |
| 44 | + - Uses AI (Gemini 2.5 Pro) to score implementations |
| 45 | + - Compares agent output against ground truth git diffs |
| 46 | + - Provides detailed scoring across multiple dimensions |
| 47 | + - Runs 3 judges in parallel and takes median for robustness |
| 48 | + |
| 49 | +5. **Test Repository Management** (`setup-test-repo.ts`) |
| 50 | + - Clones and manages git repositories for testing |
| 51 | + - Handles commit checkout and environment setup |
| 52 | + - Provides isolated testing environments |
| 53 | + |
| 54 | +### Evaluation Workflow |
| 55 | + |
| 56 | +```mermaid |
| 57 | +sequenceDiagram |
| 58 | + participant Orchestrator as Eval Orchestrator |
| 59 | + participant PromptAgent as Prompting Agent |
| 60 | + participant CodingAgent as Coding Agent (Codebuff/Claude) |
| 61 | + participant Judge as AI Judge |
| 62 | + participant Repo as Test Repository |
| 63 | +
|
| 64 | + Orchestrator->>Repo: Setup repo at commit^ (before target) |
| 65 | + Orchestrator->>PromptAgent: Start with spec |
| 66 | +
|
| 67 | + loop Up to 5 attempts |
| 68 | + PromptAgent->>PromptAgent: Analyze conversation history |
| 69 | + PromptAgent->>CodingAgent: Send implementation prompt |
| 70 | + CodingAgent->>Repo: Make code changes via tools |
| 71 | + CodingAgent->>PromptAgent: Return conversation trace |
| 72 | + PromptAgent->>PromptAgent: Decide: continue/complete/halt |
| 73 | + end |
| 74 | +
|
| 75 | + Orchestrator->>Judge: Compare output vs ground truth |
| 76 | + Judge->>Orchestrator: Return detailed scores & analysis |
| 77 | +``` |
| 78 | + |
| 79 | +## Key Features |
| 80 | + |
| 81 | +### Multi-Step Interactive Process |
| 82 | + |
| 83 | +- **Up to 5 conversation turns** between prompting agent and coding agent |
| 84 | +- **Adaptive prompting** based on conversation history and progress |
| 85 | +- **Early termination** when task is complete or off-track |
| 86 | + |
| 87 | +### Comprehensive Scoring |
| 88 | + |
| 89 | +The AI judge evaluates four key dimensions: |
| 90 | + |
| 91 | +- **Completion Score (0-10)**: How completely was the spec implemented compared to ground truth? |
| 92 | +- **Efficiency Score (0-10)**: How efficiently did the agent work without unnecessary steps? |
| 93 | +- **Code Quality Score (0-10)**: How well-structured and maintainable is the code? |
| 94 | +- **Overall Score (0-10)**: Combined assessment of implementation quality |
| 95 | + |
| 96 | +### Real-World Relevance |
| 97 | + |
| 98 | +- Uses **actual commits from real open source projects** |
| 99 | +- Tests on **diverse coding scenarios** and project types |
| 100 | +- Evaluates **end-to-end coding capabilities** including tool usage |
| 101 | + |
| 102 | +## Directory Structure |
| 103 | + |
| 104 | +``` |
| 105 | +evals/ |
| 106 | +├── git-evals/ # Main git commit evaluation system |
| 107 | +│ ├── run-git-evals.ts # Core evaluation orchestrator |
| 108 | +│ ├── run-single-eval.ts # CLI for running individual evals |
| 109 | +│ ├── run-eval-set.ts # Batch evaluation runner |
| 110 | +│ ├── judge-git-eval.ts # AI judging system |
| 111 | +│ ├── post-eval-analysis.ts # Aggregate analysis of results |
| 112 | +│ │ |
| 113 | +│ ├── runners/ # Agent integrations |
| 114 | +│ │ ├── runner.ts # Common runner interface |
| 115 | +│ │ ├── codebuff.ts # Codebuff agent runner |
| 116 | +│ │ └── claude.ts # Claude Code runner |
| 117 | +│ │ |
| 118 | +│ ├── pick-commits.ts # Intelligent commit selection |
| 119 | +│ ├── gen-evals.ts # Specification generation |
| 120 | +│ ├── gen-repo-eval.ts # End-to-end eval creation |
| 121 | +│ ├── setup-test-repo.ts # Repository management |
| 122 | +│ ├── prompting-agent.ts # Prompting agent logic |
| 123 | +│ └── types.ts # Type definitions |
| 124 | +│ |
| 125 | +├── scaffolding.ts # Test environment utilities |
| 126 | +├── test-setup.ts # Environment configuration |
| 127 | +└── knowledge.md # Additional documentation |
| 128 | +``` |
| 129 | + |
| 130 | +## Usage |
| 131 | + |
| 132 | +### Running Evaluations |
| 133 | + |
| 134 | +#### Single Evaluation |
| 135 | + |
| 136 | +```bash |
| 137 | +# Run a specific commit evaluation |
| 138 | +bun run evals/git-evals/run-single-eval.ts \ |
| 139 | + --eval-file eval-codebuff.json \ |
| 140 | + --commit-index 0 \ |
| 141 | + --agent base2 |
| 142 | + |
| 143 | +# Run by commit SHA |
| 144 | +bun run evals/git-evals/run-single-eval.ts \ |
| 145 | + --eval-file eval-manifold.json \ |
| 146 | + --commit-sha abc123 \ |
| 147 | + --output results.json |
| 148 | +``` |
| 149 | + |
| 150 | +#### Batch Evaluations |
| 151 | + |
| 152 | +```bash |
| 153 | +# Run full evaluation set |
| 154 | +bun run evals/git-evals/run-eval-set.ts |
| 155 | + |
| 156 | +# Run with specific configuration |
| 157 | +bun run evals/git-evals/run-git-evals.ts \ |
| 158 | + eval-codebuff.json \ |
| 159 | + output-dir \ |
| 160 | + codebuff |
| 161 | +``` |
| 162 | + |
| 163 | +### Creating New Evaluations |
| 164 | + |
| 165 | +#### 1. Pick Commits from Repository |
| 166 | + |
| 167 | +```bash |
| 168 | +# Analyze repository and select good evaluation commits |
| 169 | +bun run evals/git-evals/pick-commits.ts \ |
| 170 | + https://github.com/user/repo \ |
| 171 | + ./picked-commits.json \ |
| 172 | + 300 |
| 173 | +``` |
| 174 | + |
| 175 | +#### 2. Generate Evaluation File |
| 176 | + |
| 177 | +```bash |
| 178 | +# Create complete evaluation from picked commits |
| 179 | +bun run evals/git-evals/gen-repo-eval.ts \ |
| 180 | + https://github.com/user/repo \ |
| 181 | + ./picked-commits.json \ |
| 182 | + ./eval-output.json |
| 183 | +``` |
| 184 | + |
| 185 | +## Evaluation Data Format |
| 186 | + |
| 187 | +### Evaluation File Structure |
| 188 | + |
| 189 | +```typescript |
| 190 | +interface EvalData { |
| 191 | + repoUrl: string // Source repository |
| 192 | + testRepoName?: string // Optional repo name override |
| 193 | + generationDate: string // When eval was created |
| 194 | + initCommand?: string // Optional setup command |
| 195 | + evalCommits: EvalCommit[] // List of evaluation tasks |
| 196 | +} |
| 197 | + |
| 198 | +interface EvalCommit { |
| 199 | + sha: string // Target commit SHA |
| 200 | + spec: string // Natural language specification |
| 201 | + fileStates: FileState[] // Ground truth file changes |
| 202 | +} |
| 203 | +``` |
| 204 | + |
| 205 | +### Results Format |
| 206 | + |
| 207 | +```typescript |
| 208 | +interface EvalRunJudged { |
| 209 | + eval_commit: EvalCommit // Original evaluation task |
| 210 | + trace: CodebuffTrace[] // Conversation history |
| 211 | + error?: string // Any execution errors |
| 212 | + gitDiff: string // Agent's actual changes |
| 213 | + durationMs: number // Execution time |
| 214 | + costUsd: number // API costs incurred |
| 215 | + judging_results: { |
| 216 | + // AI judge analysis |
| 217 | + analysis: string |
| 218 | + strengths: string[] |
| 219 | + weaknesses: string[] |
| 220 | + metrics: { |
| 221 | + completionScore: number // 0-10 |
| 222 | + efficiencyScore: number // 0-10 |
| 223 | + codeQualityScore: number // 0-10 |
| 224 | + overallScore: number // 0-10 |
| 225 | + } |
| 226 | + } |
| 227 | +} |
| 228 | +``` |
| 229 | + |
| 230 | +## Supported Coding Agents |
| 231 | + |
| 232 | +### Codebuff Integration |
| 233 | + |
| 234 | +- Uses the Codebuff SDK for local integration |
| 235 | +- Supports custom agent types (base, base2, base-lite, etc.) |
| 236 | + |
| 237 | +### Claude Code Integration |
| 238 | + |
| 239 | +- Integrates with Anthropic's Claude Code API |
| 240 | +- Supports bypass permissions for automated testing |
| 241 | + |
| 242 | +### Adding New Agents |
| 243 | + |
| 244 | +Implement the `Runner` interface in `runners/`: |
| 245 | + |
| 246 | +```typescript |
| 247 | +export type Runner = { |
| 248 | + run: (prompt: string) => Promise<{ |
| 249 | + steps: AgentStep[] |
| 250 | + totalCostUsd: number |
| 251 | + }> |
| 252 | +} |
| 253 | +``` |
| 254 | +
|
| 255 | +## Advanced Features |
| 256 | +
|
| 257 | +### Intelligent Commit Selection |
| 258 | +
|
| 259 | +The `pick-commits.ts` system uses AI to select high-quality evaluation commits that include substantial self-contained changes. |
| 260 | +
|
| 261 | +### Judging |
| 262 | +
|
| 263 | +- **Comprehensive analysis** including strengths, weaknesses, and specific metrics |
| 264 | +- **Cost tracking** and performance monitoring |
| 265 | +- **Token management** with intelligent truncation for large contexts |
| 266 | +- **Multiple judges** (3 parallel judges with median selection) |
| 267 | +
|
| 268 | +### Post-Evaluation Analysis |
| 269 | +
|
| 270 | +The `post-eval-analysis.ts` system provides: |
| 271 | +
|
| 272 | +- **Aggregate performance metrics** across all evaluation runs |
| 273 | +- **Problem identification** with severity and frequency analysis |
| 274 | +- **Development recommendations** for improving agent performance |
| 275 | +- **Trend analysis** and systematic issue detection |
| 276 | +
|
| 277 | +## Configuration |
| 278 | +
|
| 279 | +### Test Environment |
| 280 | +
|
| 281 | +- Evaluations run in isolated git repositories |
| 282 | +- Each test gets a fresh clone at the target commit's parent |
| 283 | +- File system mocking for safe tool execution |
| 284 | +- Process isolation with proper cleanup |
| 285 | +
|
| 286 | +## Best Practices |
| 287 | +
|
| 288 | +### Creating Quality Evaluations |
| 289 | +
|
| 290 | +1. **Select diverse commits** representing different types of changes |
| 291 | +2. **Ensure clear specifications** that describe observable behavior |
| 292 | +3. **Test specifications manually** to verify implementability |
| 293 | +4. **Balance complexity** - not too simple, not overwhelming |
| 294 | +
|
| 295 | +## Examples |
| 296 | +
|
| 297 | +The `evals/git-evals/` directory contains several example evaluation files: |
| 298 | +
|
| 299 | +- `eval-codebuff.json` - Codebuff project evaluations |
| 300 | +- `eval-manifold.json` - Manifold prediction market evaluations |
| 301 | +- `eval-saleor.json` - Saleor e-commerce platform evaluations |
| 302 | +
|
| 303 | +These demonstrate the evaluation format and provide ready-to-use test cases. |
0 commit comments