generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 589
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem Statement
-
No SDK-level support for conversation caching
- Prompt caching can significantly reduce costs (up to 90% on cache reads) and latency
- Users must implement custom hooks to enable caching across tool-use loops
-
Limited scope of existing cache options
- Current
cache_prompt,cache_toolsonly support system prompt and tool definitions - Minimal impact in real-world workflows where conversation history dominates token usage
- Current
-
Provider-specific configuration
- Cache configuration only available in BedrockModel
- Other providers have cachePoint conversion logic but no way to enable via config
- Inconsistent developer experience
Proposed Solution
Add cache_strategy parameter to base Model class with hook-based auto-caching.
Key Components
- Model (ABC) : Shared Configuration
- Add cache_strategy: Optional[str] = None to the base Model class. All provider implementations (BedrockModel, AnthropicModel, LiteLLM, etc.) inherit this configuration automatically.
# Usage (any provider)
agent = Agent(model=BedrockModel(cache_strategy="auto"))
agent = Agent(model=AnthropicModel(cache_strategy="auto"))
agent = Agent(model=LiteLLM(cache_strategy="auto"))
- Agent: Hook Auto-Registration
- When cache_strategy="auto" is detected, the Agent automatically registers a ConversationCachingHook:
# In Agent.__init__()
if model.get_config().get("cache_strategy") == "auto":
self.hooks.add_hook(ConversationCachingHook())
- ConversationCachingHook: CachePoint Injection
- The hook injects a cachePoint block at the last assistant message on each BeforeModelCallEvent
# Before injection
[..., {role: "assistant", content: [...]}, {role: "user", ...}]
# After injection
[..., {role: "assistant", content: [..., {"cachePoint": {"type": "default"}}]}, {role: "user", ...}]
* This single cache point covers system prompt + tools + conversation history up to that point.
- Provider-Specific Handling (Existing Logic)
- Each model provider processes the injected cachePoint using existing conversion logic:
- BedrockModel : Pass-through (native format) | Maintain
- AnthropicModel : Convert to cache_control | Maintain
- LiteLLMModel : Convert to cache_control | Fix needed (add message-level handling)
- Each model provider processes the injected cachePoint using existing conversion logic:
Use Case
Agent workflows with tool usage where multiple model calls occur:
Single-turn scenarios:
- Tool-heavy tasks (search → fetch → analyze → respond)
- Each model call after first assistant message benefits from cache
- Test result: 50-90+% cache hit within single turn (link)
Multi-turn scenarios:
- Conversation history accumulates across turns
- Previous turns fully cached on subsequent turns
- Compounding cost savings over conversation lifetime
Impact:
- 90% cost reduction on cached tokens
- Reduced latency on subsequent model calls
Alternatives Solutions
- Manual cachePoint injection - Current approach, requires custom hook implementation
- Agent-level cache_strategy - Rejected because Model owns provider config
- Automatic system prompt caching only - Insufficient, conversation history dominates
Additional Context
- Backward compatible with existing manual cachePoint insertion
- Extensible for future cache strategies
tmokmss
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request