Skip to content

Agent-as-tool + HITL resume: weakref cache loss causes infinite pause loop #2487

@Pronto-Sage

Description

@Pronto-Sage

When using agent.as_tool() with needs_approval=True on a tool inside the nested agent, resuming after HITL approval starts a fresh inner run instead of continuing — causing an infinite pause/approve/resume cycle.


Root Cause

agent_tool_state.py stores nested run results in a module-level dict with weakref-based cleanup keyed by tool_call object identity.

When the caller:

  1. Serializes RunState via to_json()
  2. Releases the RunResult

The tool_call objects are garbage collected → weakref callbacks fire → cache entries are removed.

On resume:

  • peek_agent_tool_run_result() returns None
  • The SDK starts a fresh inner run
  • It hits needs_approval again
  • It pauses
  • The cycle repeats

Repro

  1. Create an outer agent with an inner agent via as_tool()
  2. The inner agent has a tool with needs_approval=True
  3. Run → execution pauses for approval
  4. Serialize result.to_state().to_json() and let result go out of scope
  5. Deserialize with RunState.from_json()
  6. Call Runner.run() with the resumed state + approved decisions
  7. The inner agent starts fresh instead of resuming → pauses again on the same tool

Expected

RunState.to_json() should capture nested agent-as-tool run results so they survive serialization round-trips.


Workaround

Keep a strong reference to the RunResult object between pause and resume to prevent garbage collection of tool_call objects.

This only works within a single process lifetime.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions