[buffbench] base-layer with iterative planner; spec all at once

jahooma · jahooma · commit 1cce0bc36a86 · 2025-10-08T23:51:32.000-07:00
diff --git a/.agents/base2/base-layer.ts b/.agents/base2/base-layer.ts
@@ -37,8 +37,8 @@ const definition: SecretAgentDefinition = {
     'read-only-commander',
     'decomposing-thinker',
     'code-sketcher',
+    'iterative-planner',
     'editor',
-    'decomposing-reviewer',
     'reviewer',
     'context-pruner',
   ],
@@ -83,25 +83,28 @@ The user asks you to implement a new feature. You respond in multiple steps:
 1a. Read all the relevant files using the read_files tool.
 2. Spawn one more file explorer and one more find-all-referencer with different prompts to find relevant files; spawn a decomposing thinker with questions on a key decision; spawn a decomposing thinker to plan out the feature part-by-part. Spawn a code sketcher to sketch out one key section of the code that is the most important or difficult.
 2a. Read all the relevant files using the read_files tool.
-3. Spawn a decomposing thinker to answer final design and implementation questions and critique the code sketch that was produced. Spawn one more code sketcher to sketch another key section.
+3. Spawn an iterative-planner with a step-by-step initial plan. Spawn one more code sketcher to sketch another key section.
 4. Spawn two editors to implement all the changes.
 5. Spawn a reviewer to review the changes made by the editors.
 
 
-## Guidelines
+## Spawning agents guidelines
 
 - **Sequence agents properly:** Keep in mind dependencies when spawning different agents:
   - Spawn file explorers, find-all-referencer, and researchers before thinkers because then the thinkers can use the file/research results to come up with a better conclusions
   - Spawn thinkers before editors so editors can use the insights from the thinkers.
   - Reviewers should be spawned after editors.
-- **Use the decomposing thinker also to check what context you are missing:** Ask what context you don't have for specific subtasks that you should could still acquire (with file pickers or find-all-referencers or researchers or using the read_files tool). Getting more context is one of the most important things you should do before editing or coding anything.
-- **Spawn editors later** Only spawn editors after gathering all the context.
-- **Stop and ask for guidance:** You should feel free to stop and ask the user for guidance if you're stuck or don't know what to try next, or need a clarification.
+- **Use the decomposing thinker also to check what context you are missing:** Ask what context you don't have for specific subtasks that you should could still acquire (with file pickers or find-all-referencers or researchers or using the read_files tool). Getting more context is one of the most important things you should do before planning or editing or coding anything.
+- **Once you've gathered all the context you need, create a plan:** Spawn an iterative-planner with a step-by-step initial plan, or if it's not a complex task simply write out your plan as a bullet point list.
+- **Spawn editors later** Only spawn editors after gathering all the context and creating a plan.
 - **No need to include context:** When prompting an agent, realize that many agents can already see the entire conversation history, so you can be brief in prompting them without needing to include context.
+
+## General guidelines
+- **Stop and ask for guidance:** You should feel free to stop and ask the user for guidance if you're stuck or don't know what to try next, or need a clarification.
 - **Be careful about terminal commands:** Be careful about instructing subagents to run terminal commands that could be destructive or have effects that are hard to undo (e.g. git push, running scripts that could alter production environments, installing packages globally, etc). Don't do any of these unless the user explicitly asks you to.
 `,
 
-  stepPrompt: `Don't forget to spawn agents that could help, especially: the file-explorer and find-all-referencer to get codebase context, the decomposing thinker to think about key decisions, the code sketcher to sketch out the key sections of code, and the reviewer/decomposing-reviewer to review code changes made by the editor(s).`,
+  stepPrompt: `Don't forget to spawn agents that could help, especially: the file-explorer and find-all-referencer to get codebase context, the decomposing thinker to think about key decisions, the code sketcher to sketch out the key sections of code, the iterative-planner to create a plan, and the reviewer/decomposing-reviewer to review code changes made by the editor(s).`,
 
   handleSteps: function* ({ prompt, params }) {
     let steps = 0
diff --git a/.agents/planners/iterative-planner.ts b/.agents/planners/iterative-planner.ts
@@ -0,0 +1,66 @@
+import { publisher } from '../constants'
+import type { SecretAgentDefinition } from '../types/secret-agent-definition'
+
+const definition: SecretAgentDefinition = {
+  id: 'iterative-planner',
+  publisher,
+  model: 'anthropic/claude-sonnet-4.5',
+  displayName: 'Iterative Planner',
+  spawnerPrompt:
+    'Spawn this agent when you need to create a detailed implementation plan through iterative refinement with critique and validation steps. Spawn it with a rough step-by-step initial plan.',
+  inputSchema: {
+    prompt: {
+      type: 'string',
+      description: 'The initial step-by-step plan to refine and validate',
+    },
+  },
+  includeMessageHistory: true,
+  inheritParentSystemPrompt: true,
+  outputMode: 'last_message',
+  toolNames: ['spawn_agents'],
+  spawnableAgents: ['plan-critiquer'],
+
+  instructionsPrompt: `You are an expert implementation planner. Your job is to:
+- Take an initial high-level plan and add key implementation details. Include important decisions and alternatives. Identify key interfaces and contracts between components and key pieces of code. Add validation steps to ensure correctness. Identify which steps can be done in parallel.
+- Spawn a plan-critiquer agent with the entire revised, fleshed out plan.
+- Incorporate feedback from the critiques to output a final plan.
+  
+Instructions:
+
+1. Immediately spawn the iterative-planner agent with an updated plan:
+
+Transform the initial plan into a detailed implementation guide that includes:
+
+**All User Requirements:**
+- Make sure the plan addresses all the requirements in the user's request, and does not do other stuff that the user did not ask for.
+
+**Key Decisions & Trade-offs:**
+- Architecture decisions and rationale
+- Cruxes of the plan
+- Alternatives considered
+
+**Interfaces & Contracts:**
+- Clear API signatures between components
+- Key tricky bits of code (keep this short though)
+
+**Validation Steps:**
+- How to verify each step works correctly
+- Include explicit verification steps when it makes sense in the plan.
+
+**Dependencies & Parallelism:**
+- Identify which steps depend on each other and which can be done in parallel.
+
+Feel free to completely change the initial plan if you think of something better.
+
+2. After receiving the critique, revise the plan to address all concerns while maintaining simplicity and clarity. Output the final plan.
+
+## Guidelines for the plan
+
+- IMPORTANT: Don't overengineer the plan -- prefer minimalism and simplicity in almost every case. Streamline the final plan to be as minimal as possible.
+- IMPORTANT: You must pay attention to the user's request! Make sure to address all the requirements in the user's request, and nothing more.
+- Reuse existing code whenever possible -- you may need to seek out helpers from other parts of the codebase.
+- Use existing patterns and conventions from the codebase. Keep naming consistent. It's good to read other files that could have relevant patterns and examples to understand the conventions.
+- Try not to modify more files than necessary.`,
+}
+
+export default definition
diff --git a/.agents/planners/plan-critiquer.ts b/.agents/planners/plan-critiquer.ts
@@ -0,0 +1,89 @@
+import { publisher } from '../constants'
+import type { SecretAgentDefinition } from '../types/secret-agent-definition'
+import type { ToolMessage } from '../types/util-types'
+
+const definition: SecretAgentDefinition = {
+  id: 'plan-critiquer',
+  publisher,
+  model: 'anthropic/claude-sonnet-4.5',
+  displayName: 'Plan Critiquer',
+  spawnerPrompt:
+    'Analyzes implementation plans to identify areas of concern and proposes solutions through parallel thinking.',
+  inputSchema: {
+    prompt: {
+      type: 'string',
+      description:
+        "The implementation plan to critique. Give a step-by-step breakdown of what you will do to fulfill the user's request.",
+    },
+  },
+  includeMessageHistory: true,
+  inheritParentSystemPrompt: true,
+  outputMode: 'structured_output',
+  outputSchema: {
+    type: 'object',
+    properties: {
+      critique: {
+        type: 'string',
+        description: 'Analysis of the plan with identified areas of concern',
+      },
+      suggestions: {
+        type: 'array',
+        items: {
+          type: 'object',
+        },
+        description: 'Suggestions for each area of concern',
+      },
+    },
+    required: ['critique', 'suggestions'],
+  },
+  toolNames: ['spawn_agents', 'set_output'],
+  spawnableAgents: ['decomposing-thinker'],
+
+  instructionsPrompt: `You are an expert plan reviewer. Your job is to:
+1. Analyze the implementation plan for potential issues and better alternatives.
+2. Identify 2-5 specific areas of concern that need deeper analysis
+3. Spawn a decomposing-thinker agent with the concerns as prompts. For each concern, formulate it as a specific question that can be answered by the thinker agent.
+
+## Guidelines for the critique
+
+IMPORTANT: You must pay attention to the user's request! Make sure to address all the requirements in the user's request, and nothing more.
+
+For the plan:
+- Focus on implementing the simplest solution that will accomplish the task in a high quality manner.
+- Reuse existing code whenever possible -- you may need to seek out helpers from other parts of the codebase.
+- Use existing patterns and conventions from the codebase. Keep naming consistent. It's good to read other files that could have relevant patterns and examples to understand the conventions.
+- Try not to modify more files than necessary.
+`,
+
+  handleSteps: function* () {
+    const { agentState } = yield 'STEP'
+
+    const lastAssistantMessage = agentState.messageHistory
+      .filter((m) => m.role === 'assistant')
+      .pop()
+
+    const critique =
+      typeof lastAssistantMessage?.content === 'string'
+        ? lastAssistantMessage.content
+        : ''
+    const toolResult = agentState.messageHistory
+      .filter((m) => m.role === 'tool' && m.content.toolName === 'spawn_agents')
+      .pop() as ToolMessage
+
+    const suggestions = toolResult
+      ? toolResult.content.output.map((result) =>
+          result.type === 'json' ? result.value : {},
+        )[0]
+      : []
+
+    yield {
+      toolName: 'set_output',
+      input: {
+        critique,
+        suggestions,
+      },
+    }
+  },
+}
+
+export default definition
diff --git a/evals/git-evals/run-eval-set.ts b/evals/git-evals/run-eval-set.ts
@@ -134,21 +134,21 @@ async function runEvalSet(options: {
       evalDataPath: path.join(__dirname, 'eval-codebuff2.json'),
       outputDir,
     },
-    {
-      name: 'manifold',
-      evalDataPath: path.join(__dirname, 'eval-manifold2.json'),
-      outputDir,
-    },
-    {
-      name: 'plane',
-      evalDataPath: path.join(__dirname, 'eval-plane.json'),
-      outputDir,
-    },
-    {
-      name: 'saleor',
-      evalDataPath: path.join(__dirname, 'eval-saleor.json'),
-      outputDir,
-    },
+    // {
+    //   name: 'manifold',
+    //   evalDataPath: path.join(__dirname, 'eval-manifold2.json'),
+    //   outputDir,
+    // },
+    // {
+    //   name: 'plane',
+    //   evalDataPath: path.join(__dirname, 'eval-plane.json'),
+    //   outputDir,
+    // },
+    // {
+    //   name: 'saleor',
+    //   evalDataPath: path.join(__dirname, 'eval-saleor.json'),
+    //   outputDir,
+    // },
   ]
 
   console.log(`Running ${evalConfigs.length} evaluations:`)
diff --git a/evals/git-evals/run-single-eval-process.ts b/evals/git-evals/run-single-eval-process.ts
@@ -74,7 +74,7 @@ async function main() {
       fingerprintId,
       codingAgent as any,
       agent,
-      false,
+      true,
     )
 
     // Check again after long-running operation
diff --git a/evals/git-evals/run-single-eval.ts b/evals/git-evals/run-single-eval.ts
@@ -199,7 +199,7 @@ async function runSingleEvalTask(options: {
       fingerprintId,
       codingAgent,
       agentType,
-      false,
+      true,
     )
 
     const duration = Date.now() - startTime

Original file line number	Diff line number	Diff line change
`@@ -74,7 +74,7 @@ async function main() {`
`74`	`74`	`fingerprintId,`
`75`	`75`	`codingAgent as any,`
`76`	`76`	`agent,`
`77`		`- false,`
	`77`	`+ true,`
`78`	`78`	`)`
`79`	`79`
`80`	`80`	`// Check again after long-running operation`
Original file line number	Diff line number	Diff line change
`@@ -199,7 +199,7 @@ async function runSingleEvalTask(options: {`
`199`	`199`	`fingerprintId,`
`200`	`200`	`codingAgent,`
`201`	`201`	`agentType,`
`202`		`- false,`
	`202`	`+ true,`
`203`	`203`	`)`
`204`	`204`
`205`	`205`	`const duration = Date.now() - startTime`