Skip to content

Conversation

@Gkrumbach07
Copy link
Collaborator

fix ootb workflow cache
better google mcp restart
more reliable initial prompt

Implemented a new caching mechanism for out-of-the-box workflows to reduce GitHub API rate limits. The cache stores workflows for 5 minutes and checks for existing cached data before making API calls. Updated the ListOOTBWorkflows function to utilize this cache, improving performance and efficiency.
…dapter with persistent client support

The AGUI proxy now includes a retry mechanism for background requests to the runner, allowing for up to 15 attempts with exponential backoff if the runner is not ready. Additionally, the ClaudeCodeAdapter has been updated to support a persistent client for improved performance and to handle Google OAuth credentials more effectively, ensuring they are refreshed when available. This change enhances the overall reliability and efficiency of the system.
Updated the ClaudeCodeAdapter to create a fresh client for each run, simplifying the client management process. Removed the app_state dependency for client persistence, ensuring that each request operates independently. Adjusted related logging and cleanup procedures to reflect this change, enhancing reliability and clarity in client handling.
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR implements three key improvements for the AG-UI integration:

  1. OOTB workflow caching - Adds 5-minute in-memory cache to avoid GitHub API rate limits
  2. Proxy retry logic - Handles runner startup delays with exponential backoff (15 retries)
  3. Improved initial prompt handling - Better Google MCP credential refresh and more reliable startup flow

Overall Assessment: Good improvements with solid implementation patterns. A few areas need attention before merge.


Issues by Severity

🚫 Blocker Issues

None - No critical blockers found.

🔴 Critical Issues

1. Race condition in OOTB cache implementation (sessions.go:1760-1768, 1806-1814)

The cache has a classic "check-then-act" race condition:

// Thread A checks cache (miss)
ootbCache.mu.RLock()
if ootbCache.cacheKey == cacheKey && ... {
    // ...
}
ootbCache.mu.RUnlock()

// Thread B could write here

// Thread A fetches from GitHub
entries, err := fetchGitHubDirectoryListing(...)

// Thread A writes cache
ootbCache.mu.Lock()
ootbCache.workflows = workflows
// ...
ootbCache.mu.Unlock()

Issue: Multiple concurrent requests during cache miss will all fetch from GitHub, defeating the rate limit protection.

Recommendation: Use single-flight pattern or mutex-protected "fetch-in-progress" flag to ensure only one goroutine fetches on cache miss.

2. Potential credential exposure in logs (adapter.py:443, 450)

While this PR doesn't introduce new token logging, the code proximity to credential handling warrants verification:

logger.info("Creating ClaudeSDKClient...")
# Ensure no credentials leak in SDK client initialization logs

Recommendation: Audit all log statements in _run_claude_agent_sdk to ensure no ANTHROPIC_API_KEY, BOT_TOKEN, or GitHub tokens are logged.

🟡 Major Issues

3. Unbounded goroutine accumulation (agui_proxy.go:145-203)

Each /agui/run request spawns a background goroutine with 2-hour timeout. If a session creates many runs (especially failed/retried runs), goroutines accumulate.

go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ...
}()

Issue: No tracking or limit on concurrent background goroutines per session.

Recommendation:

  • Track active run goroutines per session (sync.Map or concurrent-safe counter)
  • Limit concurrent runs or implement graceful cancellation of old runs
  • Consider shorter timeout for failed runs

4. Missing authentication check in OOTB workflows handler (sessions.go:1739)

The ListOOTBWorkflows handler doesn't enforce user authentication when ?project parameter is omitted:

func ListOOTBWorkflows(c *gin.Context) {
    // No GetK8sClientsForRequest check at the top
    project := c.Query("project") // Optional
    if project != "" {
        // Only checks auth if project provided
        k8sClt, sessDyn := GetK8sClientsForRequest(c)
        // ...
    }
    // Continues without auth if no project

Issue: Public endpoint exposure. While OOTB workflows are public, this diverges from the standard pattern requiring authentication for all API operations.

Recommendation: Move auth check to top of handler or document why this endpoint is intentionally unauthenticated.

5. Client lifecycle change lacks testing documentation (adapter.py:485-500, 729-736)

Major architectural change from persistent client to fresh client per run:

# OLD: client_ctx = create_sdk_client(options)
#      client = await client_ctx.__aenter__()

# NEW: client = create_sdk_client(options)
#      await client.connect()

Issue: No documentation on testing this change or known impacts on conversation continuity.

Recommendation: Add comment explaining testing approach and any observed behavior changes (especially for conversation history persistence).

🔵 Minor Issues

6. Deprecated string capitalization function (sessions.go:1848)

workflowName = strings.Title(workflowName) // Deprecated since Go 1.18

Recommendation: Use cases.Title(language.English) from golang.org/x/text/cases.

7. Inconsistent error handling in Google credentials (adapter.py:1474-1513)

Function _setup_google_credentials has inconsistent return type - returns nothing on success but calls _try_copy_google_credentials which returns bool:

async def _setup_google_credentials(self):
    await self._try_copy_google_credentials()  # Return value ignored

Recommendation: Either propagate the return value or document why it's intentionally ignored during initial setup.

8. Magic number in retry logic (agui_proxy.go:156)

maxRetries := 15
retryDelay := 500 * time.Millisecond

Recommendation: Extract to constants with explanation:

const (
    maxRunnerStartupRetries = 15  // ~15s total with exponential backoff
    initialRetryDelay = 500 * time.Millisecond
)

9. Initial prompt delay hard-coded (main.py:129)

await asyncio.sleep(3)  # Hard-coded delay

Recommendation: Extract to environment variable or constant with comment explaining why 3 seconds.


Positive Highlights

Excellent concurrency patterns: Proper use of sync.RWMutex for read-heavy cache with separate read/write locks.

Intelligent error recovery: Stale cache fallback when GitHub API fails (lines 1806-1814) is great UX.

Smart retry logic: Exponential backoff with cap prevents thundering herd (agui_proxy.go:196-200).

Clear documentation: Code comments explain WHY, not just WHAT (e.g., "check for newly available Google OAuth credentials").

Security conscious: Proper token handling patterns maintained (e.g., not logging tokens).

Observability: Good structured logging with context (run IDs, attempt numbers, cache age).


Recommendations

Priority 1 (Before Merge)

  1. Fix cache race condition - Implement single-flight pattern for GitHub fetches
  2. Document OOTB auth decision - Explain why endpoint allows unauthenticated access OR add auth check
  3. Audit credential logging - Verify no tokens in logs during SDK client initialization

Priority 2 (Follow-up PR)

  1. Add goroutine tracking - Implement concurrent run limits or active goroutine monitoring
  2. Update string capitalization - Replace deprecated strings.Title
  3. Extract magic numbers - Convert hard-coded values to named constants

Priority 3 (Nice-to-have)

  1. Add tests for cache behavior - Unit tests for concurrent cache access, stale cache fallback
  2. Document client lifecycle change - Add testing notes for fresh client per run
  3. Consistent error handling - Document or fix return value inconsistency in Google credential setup

Code Quality Checklist

Authentication: Uses user token patterns correctly (where applicable)
⚠️ Authorization: OOTB endpoint potentially bypasses auth check
Error Handling: Good structured logging, appropriate HTTP status codes
Token Security: No new token logging introduced
Type Safety: Proper type assertions with checks
Concurrency: Good mutex usage (but has race condition)
Logging: Structured logs with relevant context
⚠️ Testing: Missing documentation on client lifecycle testing


Overall Recommendation

Conditional Approval - Address the cache race condition (#1) and document the OOTB auth decision (#4), then this is good to merge. The retry logic and credential refresh improvements are solid additions.

The changes follow established patterns and improve reliability. Main concerns are the concurrent cache access pattern and potential goroutine accumulation under load.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit efad481 into ambient-code:main Dec 18, 2025
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant