Skip to content

Commit 03b5574

Browse files
committed
Initial readme for evals
1 parent 02b8e7a commit 03b5574

File tree

1 file changed

+303
-0
lines changed

1 file changed

+303
-0
lines changed

evals/README.md

Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# Codebuff Evals
2+
3+
This directory contains the evaluation framework for testing and measuring Codebuff's coding capabilities, with a focus on the innovative **Git Commit Reimplementation Evaluation** system.
4+
5+
## Overview
6+
7+
The evaluation system takes a fundamentally different approach from traditional coding benchmarks like SWE Bench or Terminal Bench. Instead of passing predefined tests, our evaluations challenge coding agents to reimplement real git commits from open source projects over multiple interactive steps.
8+
9+
### Core Idea: Commit Reconstruction Methodology
10+
11+
Our evaluation framework centers on having coding agents reconstruct actual git commits from open source repositories through an interactive, multi-turn process.
12+
13+
A specialized prompting agent guides the coding agent through up to 5 conversational rounds to implement a specification derived from the original commit's changes.
14+
15+
The process concludes with an AI judge that provides comprehensive scoring by comparing the agent's implementation against the ground truth commit.
16+
17+
This methodology enables nuanced evaluation across multiple dimensions: an agent might produce functionally correct code but receive lower scores for being unnecessarily verbose, failing to leverage existing helper functions, missing edge cases present in the original implementation, or taking an inefficient path with excessive revisions and mistakes.
18+
19+
## Architecture
20+
21+
### System Components
22+
23+
1. **Evaluation Orchestration** (`run-git-evals.ts`, `run-eval-set.ts`)
24+
25+
- Manages the complete evaluation pipeline
26+
- Handles concurrency and process management
27+
- Coordinates between all system components
28+
29+
2. **Agent Runners** (`runners/`)
30+
31+
- **Codebuff Runner**: Integrates with local Codebuff installation
32+
- **Claude Runner**: Integrates with Anthropic's Claude Code
33+
- **Runner Interface**: Common abstraction for all coding agents
34+
35+
3. **Prompting Agent** (`prompting-agent.ts`)
36+
37+
- Acts as the "human developer" in the loop
38+
- Analyzes conversation history and decides next actions
39+
- Generates follow-up prompts to guide the coding agent
40+
- Makes decisions: `continue`, `complete`, or `halt`
41+
42+
4. **Judging System** (`judge-git-eval.ts`)
43+
44+
- Uses AI (Gemini 2.5 Pro) to score implementations
45+
- Compares agent output against ground truth git diffs
46+
- Provides detailed scoring across multiple dimensions
47+
- Runs 3 judges in parallel and takes median for robustness
48+
49+
5. **Test Repository Management** (`setup-test-repo.ts`)
50+
- Clones and manages git repositories for testing
51+
- Handles commit checkout and environment setup
52+
- Provides isolated testing environments
53+
54+
### Evaluation Workflow
55+
56+
```mermaid
57+
sequenceDiagram
58+
participant Orchestrator as Eval Orchestrator
59+
participant PromptAgent as Prompting Agent
60+
participant CodingAgent as Coding Agent (Codebuff/Claude)
61+
participant Judge as AI Judge
62+
participant Repo as Test Repository
63+
64+
Orchestrator->>Repo: Setup repo at commit^ (before target)
65+
Orchestrator->>PromptAgent: Start with spec
66+
67+
loop Up to 5 attempts
68+
PromptAgent->>PromptAgent: Analyze conversation history
69+
PromptAgent->>CodingAgent: Send implementation prompt
70+
CodingAgent->>Repo: Make code changes via tools
71+
CodingAgent->>PromptAgent: Return conversation trace
72+
PromptAgent->>PromptAgent: Decide: continue/complete/halt
73+
end
74+
75+
Orchestrator->>Judge: Compare output vs ground truth
76+
Judge->>Orchestrator: Return detailed scores & analysis
77+
```
78+
79+
## Key Features
80+
81+
### Multi-Step Interactive Process
82+
83+
- **Up to 5 conversation turns** between prompting agent and coding agent
84+
- **Adaptive prompting** based on conversation history and progress
85+
- **Early termination** when task is complete or off-track
86+
87+
### Comprehensive Scoring
88+
89+
The AI judge evaluates four key dimensions:
90+
91+
- **Completion Score (0-10)**: How completely was the spec implemented compared to ground truth?
92+
- **Efficiency Score (0-10)**: How efficiently did the agent work without unnecessary steps?
93+
- **Code Quality Score (0-10)**: How well-structured and maintainable is the code?
94+
- **Overall Score (0-10)**: Combined assessment of implementation quality
95+
96+
### Real-World Relevance
97+
98+
- Uses **actual commits from real open source projects**
99+
- Tests on **diverse coding scenarios** and project types
100+
- Evaluates **end-to-end coding capabilities** including tool usage
101+
102+
## Directory Structure
103+
104+
```
105+
evals/
106+
├── git-evals/ # Main git commit evaluation system
107+
│ ├── run-git-evals.ts # Core evaluation orchestrator
108+
│ ├── run-single-eval.ts # CLI for running individual evals
109+
│ ├── run-eval-set.ts # Batch evaluation runner
110+
│ ├── judge-git-eval.ts # AI judging system
111+
│ ├── post-eval-analysis.ts # Aggregate analysis of results
112+
│ │
113+
│ ├── runners/ # Agent integrations
114+
│ │ ├── runner.ts # Common runner interface
115+
│ │ ├── codebuff.ts # Codebuff agent runner
116+
│ │ └── claude.ts # Claude Code runner
117+
│ │
118+
│ ├── pick-commits.ts # Intelligent commit selection
119+
│ ├── gen-evals.ts # Specification generation
120+
│ ├── gen-repo-eval.ts # End-to-end eval creation
121+
│ ├── setup-test-repo.ts # Repository management
122+
│ ├── prompting-agent.ts # Prompting agent logic
123+
│ └── types.ts # Type definitions
124+
125+
├── scaffolding.ts # Test environment utilities
126+
├── test-setup.ts # Environment configuration
127+
└── knowledge.md # Additional documentation
128+
```
129+
130+
## Usage
131+
132+
### Running Evaluations
133+
134+
#### Single Evaluation
135+
136+
```bash
137+
# Run a specific commit evaluation
138+
bun run evals/git-evals/run-single-eval.ts \
139+
--eval-file eval-codebuff.json \
140+
--commit-index 0 \
141+
--agent base2
142+
143+
# Run by commit SHA
144+
bun run evals/git-evals/run-single-eval.ts \
145+
--eval-file eval-manifold.json \
146+
--commit-sha abc123 \
147+
--output results.json
148+
```
149+
150+
#### Batch Evaluations
151+
152+
```bash
153+
# Run full evaluation set
154+
bun run evals/git-evals/run-eval-set.ts
155+
156+
# Run with specific configuration
157+
bun run evals/git-evals/run-git-evals.ts \
158+
eval-codebuff.json \
159+
output-dir \
160+
codebuff
161+
```
162+
163+
### Creating New Evaluations
164+
165+
#### 1. Pick Commits from Repository
166+
167+
```bash
168+
# Analyze repository and select good evaluation commits
169+
bun run evals/git-evals/pick-commits.ts \
170+
https://github.com/user/repo \
171+
./picked-commits.json \
172+
300
173+
```
174+
175+
#### 2. Generate Evaluation File
176+
177+
```bash
178+
# Create complete evaluation from picked commits
179+
bun run evals/git-evals/gen-repo-eval.ts \
180+
https://github.com/user/repo \
181+
./picked-commits.json \
182+
./eval-output.json
183+
```
184+
185+
## Evaluation Data Format
186+
187+
### Evaluation File Structure
188+
189+
```typescript
190+
interface EvalData {
191+
repoUrl: string // Source repository
192+
testRepoName?: string // Optional repo name override
193+
generationDate: string // When eval was created
194+
initCommand?: string // Optional setup command
195+
evalCommits: EvalCommit[] // List of evaluation tasks
196+
}
197+
198+
interface EvalCommit {
199+
sha: string // Target commit SHA
200+
spec: string // Natural language specification
201+
fileStates: FileState[] // Ground truth file changes
202+
}
203+
```
204+
205+
### Results Format
206+
207+
```typescript
208+
interface EvalRunJudged {
209+
eval_commit: EvalCommit // Original evaluation task
210+
trace: CodebuffTrace[] // Conversation history
211+
error?: string // Any execution errors
212+
gitDiff: string // Agent's actual changes
213+
durationMs: number // Execution time
214+
costUsd: number // API costs incurred
215+
judging_results: {
216+
// AI judge analysis
217+
analysis: string
218+
strengths: string[]
219+
weaknesses: string[]
220+
metrics: {
221+
completionScore: number // 0-10
222+
efficiencyScore: number // 0-10
223+
codeQualityScore: number // 0-10
224+
overallScore: number // 0-10
225+
}
226+
}
227+
}
228+
```
229+
230+
## Supported Coding Agents
231+
232+
### Codebuff Integration
233+
234+
- Uses the Codebuff SDK for local integration
235+
- Supports custom agent types (base, base2, base-lite, etc.)
236+
237+
### Claude Code Integration
238+
239+
- Integrates with Anthropic's Claude Code API
240+
- Supports bypass permissions for automated testing
241+
242+
### Adding New Agents
243+
244+
Implement the `Runner` interface in `runners/`:
245+
246+
```typescript
247+
export type Runner = {
248+
run: (prompt: string) => Promise<{
249+
steps: AgentStep[]
250+
totalCostUsd: number
251+
}>
252+
}
253+
```
254+
255+
## Advanced Features
256+
257+
### Intelligent Commit Selection
258+
259+
The `pick-commits.ts` system uses AI to select high-quality evaluation commits that include substantial self-contained changes.
260+
261+
### Judging
262+
263+
- **Comprehensive analysis** including strengths, weaknesses, and specific metrics
264+
- **Cost tracking** and performance monitoring
265+
- **Token management** with intelligent truncation for large contexts
266+
- **Multiple judges** (3 parallel judges with median selection)
267+
268+
### Post-Evaluation Analysis
269+
270+
The `post-eval-analysis.ts` system provides:
271+
272+
- **Aggregate performance metrics** across all evaluation runs
273+
- **Problem identification** with severity and frequency analysis
274+
- **Development recommendations** for improving agent performance
275+
- **Trend analysis** and systematic issue detection
276+
277+
## Configuration
278+
279+
### Test Environment
280+
281+
- Evaluations run in isolated git repositories
282+
- Each test gets a fresh clone at the target commit's parent
283+
- File system mocking for safe tool execution
284+
- Process isolation with proper cleanup
285+
286+
## Best Practices
287+
288+
### Creating Quality Evaluations
289+
290+
1. **Select diverse commits** representing different types of changes
291+
2. **Ensure clear specifications** that describe observable behavior
292+
3. **Test specifications manually** to verify implementability
293+
4. **Balance complexity** - not too simple, not overwhelming
294+
295+
## Examples
296+
297+
The `evals/git-evals/` directory contains several example evaluation files:
298+
299+
- `eval-codebuff.json` - Codebuff project evaluations
300+
- `eval-manifold.json` - Manifold prediction market evaluations
301+
- `eval-saleor.json` - Saleor e-commerce platform evaluations
302+
303+
These demonstrate the evaluation format and provide ready-to-use test cases.

0 commit comments

Comments
 (0)