Skip to content

Conversation

@Gkrumbach07
Copy link
Collaborator

No description provided.

@Gkrumbach07 Gkrumbach07 added this to the v0.0.14 milestone Dec 17, 2025
@Gkrumbach07 Gkrumbach07 changed the title UI cleanup Add AG-UI protocol for message streaming Dec 17, 2025
@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

Claude Code Review

Summary

This PR implements a major architectural refactoring transitioning from WebSocket to AG-UI protocol (HTTP/SSE). 55 files changed: +6,732/-4,224 lines.

Overall Assessment: Well-structured implementation with several CRITICAL concurrency and error handling issues that must be fixed before merge.

Critical Issues (Must Fix)

  1. Race Condition - agui.go:119-127 uses RLock for check-then-act pattern (needs Lock)
  2. Hard-coded Sleep - time.Sleep(100ms) masks synchronization issues
  3. Silent Event Drops - Channel overflows drop events without backpressure
  4. Missing Panic Recovery - Goroutines will crash backend on panic
  5. Frontend Timeout Leak - Reconnection timeouts not cleared on unmount
  6. Type Safety Violations - Uses 'as unknown as' bypassing TypeScript checks

Major Issues (Should Fix)

  1. RBAC Verification - Verify new AG-UI endpoints have proper middleware
  2. Incomplete Compaction - Discards incomplete tools instead of marking interrupted
  3. Build Metadata - No validation if ldflags injection fails
  4. No Test Coverage - Zero tests for new AG-UI code

Positive Highlights

✅ Well-structured AG-UI protocol types
✅ Sound event streaming architecture
✅ Proper channel-based pub/sub
✅ Comprehensive TypeScript type guards
✅ Excellent build metadata
✅ Smart compact-on-read strategy

Recommendation: Request changes for CRITICAL issues (1-6), then re-review.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

Claude Code Review

Summary

This PR introduces a major architectural change by implementing the AG-UI protocol for message streaming between the frontend, backend, and runner components. The refactor moves from WebSocket-based communication to HTTP/SSE (Server-Sent Events) with a FastAPI server in the runner. While the protocol implementation appears solid, there are several critical security and architectural concerns that must be addressed before merge.

Overall Assessment: 🟡 Major revision required - The protocol implementation is good, but security issues and architectural concerns need resolution.


Issues by Severity

🚫 Blocker Issues

1. Missing User Token Authentication in AG-UI Proxy Endpoints

  • Location: components/backend/websocket/agui_proxy.go (all proxy handlers)
  • Issue: New AG-UI endpoints (/agui/run, /agui/events, /agui/interrupt, etc.) DO NOT use user-scoped Kubernetes clients
  • Security Impact: RBAC bypass - users can interact with sessions they don't have permission to access
  • Expected Pattern: All handlers MUST call handlers.GetK8sClientsForRequest(c) and validate user authorization
  • Example violation:
// agui_proxy.go:26 - HandleAGUIRunProxy
func HandleAGUIRunProxy(c *gin.Context) {
    projectName := c.Param("projectName")
    sessionName := c.Param("sessionName")
    // ❌ MISSING: User token authentication check
    // ✅ REQUIRED:
    // reqK8s, _ := handlers.GetK8sClientsForRequest(c)
    // if reqK8s == nil {
    //     c.JSON(http.StatusUnauthorized, gin.H{"error": "Unauthorized"})
    //     return
    // }
  • Files affected: All handlers in agui_proxy.go (HandleAGUIRunProxy, HandleAGUIEvents, HandleAGUIInterrupt, HandleAGUIHistory, HandleAGUIRuns)
  • Reference: CLAUDE.md lines 344-347, .claude/patterns/k8s-client-usage.md

2. No RBAC Permission Checks Before Proxying

  • Location: components/backend/websocket/agui_proxy.go
  • Issue: Handlers proxy requests to runner without verifying user has access to the namespace/session
  • Required Fix: Add SelfSubjectAccessReview checks before proxying:
ssar := &authv1.SelfSubjectAccessReview{
    Spec: authv1.SelfSubjectAccessReviewSpec{
        ResourceAttributes: &authv1.ResourceAttributes{
            Group:     "vteam.ambient-code",
            Resource:  "agenticsessions",
            Verb:      "update", // or "get" for read-only endpoints
            Namespace: projectName,
        },
    },
}
res, err := reqK8s.AuthorizationV1().SelfSubjectAccessReviews().Create(ctx, ssar, v1.CreateOptions{})
if err \!= nil || \!res.Status.Allowed {
    c.JSON(http.StatusForbidden, gin.H{"error": "Unauthorized"})
    return
}

3. Unsafe Type Assertions Without Checking

  • Location: components/backend/websocket/agui.go:159, 182, 196, 218 (multiple instances)
  • Pattern Violation: Direct type assertions that can panic
  • Examples:
// Line 159
eventType, _ := event["type"].(string)  // ❌ Silent failure if type is wrong

// Line 182
eventRunID, _ := event["runId"].(string)  // ❌ No validation
  • Required Fix: Use unstructured.Nested* helpers or explicit checks:
eventType, ok := event["type"].(string)
if \!ok {
    log.Printf("Invalid event type, skipping")
    return
}
  • Reference: CLAUDE.md lines 361-365

🔴 Critical Issues

4. Potential Goroutine Leaks in Background Stream Consumer

  • Location: components/backend/websocket/agui_proxy.go:112-180
  • Issue: Background goroutine that consumes runner stream may leak if request context is cancelled
  • Risk: Over time, leaked goroutines accumulate and consume memory/CPU
  • Current code:
go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ... long-running HTTP stream consumption
}()
  • Problem: Uses context.Background() instead of request context - won't cancel when client disconnects
  • Recommendation: Track goroutines and ensure cleanup:
// Use request context (with extended timeout)
ctx, cancel := context.WithTimeout(c.Request.Context(), 2*time.Hour)
defer cancel()

// OR: Add to sync.WaitGroup for graceful shutdown tracking

5. Missing Error Context in Logs

  • Location: Throughout agui.go and agui_proxy.go
  • Issue: Many log statements lack critical context (session name, project, run ID)
  • Impact: Debugging production issues will be difficult
  • Examples:
// agui.go:78
log.Printf("AGUI: subscriber channel full, skipping event")
// ❌ Missing: which session? which run? which event type?

// ✅ Better:
log.Printf("AGUI: subscriber channel full for session %s run %s, skipping event %s", 
    sessionID, runID, eventType)

6. No Rate Limiting on /agui/run Endpoint

  • Location: agui_proxy.go:24 (HandleAGUIRunProxy)
  • Risk: Users can spam run creation, exhausting backend/runner resources
  • Recommendation: Add per-user rate limiting (e.g., max 10 runs/minute)

7. Long Timeout Without Justification

  • Location: agui_proxy.go:114 - context.WithTimeout(context.Background(), 2*time.Hour)
  • Issue: 2-hour timeout seems arbitrary and very long
  • Question: What is the expected max duration of a Claude Code session? Should this be configurable?

🟡 Major Issues

8. Frontend Hook Uses any Type

  • Location: components/frontend/src/hooks/use-agui-stream.ts
  • Violation: Frontend standard - Zero any Types (CLAUDE.md line 1035)
  • Issue: While types are generally good, check for any lingering any usage
  • Action: Run ESLint and ensure @typescript-eslint/no-explicit-any passes

9. Inconsistent Error Handling Patterns

  • Location: agui.go - some functions return errors, others log silently
  • Example:
    • persistAGUIEventMap (line ~250+) - logs but doesn't return error
    • handleStreamedEvent (line 195+) - logs parse failures but continues
  • Impact: Silent failures make debugging harder
  • Recommendation: Establish consistent pattern - either propagate errors or document why logging is sufficient

10. Missing Tests for New AG-UI Endpoints

  • Location: No test files for agui.go or agui_proxy.go
  • Required: Add contract tests validating:
    • Unauthorized access returns 401/403
    • Valid requests proxy correctly
    • Event streaming works end-to-end
    • Error cases handled gracefully

11. Removed WebSocket Code Without Deprecation Warning

  • Location: Deleted websocket/handlers.go, websocket/hub.go
  • Risk: Breaking change if any external clients depend on WebSocket protocol
  • Recommendation: Add migration guide or deprecation notice if this affects users

12. Event Compaction Logic Complexity

  • Location: components/backend/websocket/compaction.go (new file, 451 lines)
  • Concern: Complex event compaction logic without clear documentation on edge cases
  • Recommendation: Add comprehensive comments explaining:
    • What events are compacted and why
    • How partial tool call states are handled
    • Edge cases (e.g., out-of-order events, duplicate IDs)

🔵 Minor Issues

13. Magic Numbers in Code

  • Location: agui.go:54 - make(chan *types.BaseEvent, 100)
  • Issue: Channel buffer size of 100 is hardcoded
  • Recommendation: Extract to constant with explanation

14. Inconsistent Naming Conventions

  • Location: agui.go - aguiRuns, aguiRunsMu, aguiEventChannels
  • Issue: Mixing camelCase and package prefixes
  • Recommendation: Consider runStates, runStatesMu for consistency with Go naming

15. Build Metadata Implementation

  • Location: Makefile, Dockerfiles - new build-arg metadata
  • Positive: Good addition for traceability
  • Minor Issue: GIT_DIRTY uses || echo "-dirty" which might always append "-dirty"
  • Fix: git diff --quiet && echo "" || echo "-dirty"

16. Frontend: Missing Loading States on Buttons

  • Location: Check components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
  • Requirement: All buttons should have loading states (CLAUDE.md line 1050)
  • Action: Verify buttons use disabled={isLoading} pattern

17. Removed gorilla/websocket Dependency

  • Location: components/backend/go.mod
  • Issue: Dependency removed but should verify no other code depends on it
  • Action: Run go mod tidy && go build ./... to ensure clean build

Positive Highlights

Well-Structured Protocol Implementation: AG-UI types and events follow the spec closely
Clear Separation of Concerns: Backend proxies, runner implements, frontend consumes
Good Type Safety in Frontend: AG-UI types are well-defined in TypeScript
Event Persistence: Events are stored for replay/history (good for debugging)
Build Metadata: Excellent addition for production traceability
Documentation: Good inline comments explaining AG-UI protocol decisions
Error Recovery: Event stream reconnection logic in frontend hook
Performance: SSE is more efficient than WebSocket for one-way streaming


Recommendations

Immediate Actions (Before Merge)

  1. 🚫 FIX SECURITY ISSUES (Blockers Outcome: Reduce Refinement Time with agent System #1, Epic: RAT Architecture & Design #2):

    • Add GetK8sClientsForRequest to ALL AG-UI handlers
    • Add RBAC checks before proxying to runner
    • Add tests validating unauthorized access returns 403
  2. 🔴 FIX TYPE SAFETY (Blocker Epic: Data Source Integration #3):

    • Replace all unchecked type assertions with safe alternatives
    • Use val, ok := map[key].(type) pattern
  3. 🔴 FIX GOROUTINE LEAKS (Critical Epic: AI Agent Development #4):

    • Use request context in background stream consumer
    • Add shutdown tracking for graceful cleanup
  4. 🟡 ADD TESTS:

    • Backend: AG-UI endpoint contract tests
    • Frontend: Hook behavior tests
    • Integration: End-to-end event flow

Post-Merge Improvements

  • Add rate limiting on /agui/run
  • Document event compaction algorithm
  • Add monitoring/metrics for event stream health
  • Consider making timeout configurable via ProjectSettings

Architecture Questions

  1. Why 2-hour timeout? Is this based on expected session duration?
  2. Event retention policy? How long are events stored? Cleanup strategy?
  3. Multi-tenancy: Does event storage isolate sessions by namespace?
  4. Backward compatibility: Migration path for WebSocket clients?

Pre-Merge Checklist

  • Security: User token auth added to all AG-UI handlers
  • Security: RBAC checks added before proxying
  • Code Quality: All type assertions are checked
  • Code Quality: Goroutine lifecycle managed properly
  • Testing: Contract tests for new endpoints added
  • Testing: Frontend hook tests added
  • Linting: make lint passes for backend and frontend
  • Build: make build-all succeeds
  • Documentation: Update API docs with new endpoints

Verdict

Status: ⚠️ DO NOT MERGE until blocker issues (#1, #2, #3) are resolved.

The AG-UI protocol implementation is architecturally sound and well-structured, but the missing authentication and authorization checks are critical security vulnerabilities that violate established patterns in CLAUDE.md. Once security issues are addressed and tests are added, this will be a solid improvement to the platform.

Estimated effort to fix blockers: 4-6 hours
Estimated effort for full polish: 1-2 days

Happy to review again after fixes! 🚀


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR introduces the AG-UI protocol for message streaming, replacing the previous WebSocket-based communication with a modern HTTP/SSE approach. The changes are extensive (7,286 additions, 4,369 deletions across 58 files). While the architectural direction is sound, there are critical security, error handling, and code quality issues that must be addressed before merging.

Overall Assessment: Requires significant revisions - Multiple blocker and critical issues identified.

Issues by Severity

Blocker Issues

1. Global mutable state without proper synchronization (components/backend/websocket/agui.go:23-36)

Problem: Package-level global state with nested maps. The nested map structure is not fully protected by RWMutex when inner maps are modified.
Risk: Race conditions, data corruption, potential panics in production.

2. Missing RBAC validation in AG-UI proxy endpoints (components/backend/routes.go:83-90)

Problem: New AG-UI endpoints lack ValidateProjectContext() middleware. Users can access endpoints without RBAC checks.
Risk: CRITICAL SECURITY VULNERABILITY - unauthorized access to session data.

3. Potential goroutine leaks in subscriber management (components/backend/websocket/agui.go:53-66)

Problem: No cleanup mechanism when runs finish. Channels closed without checking if already closed.
Risk: Memory leaks, panics, resource exhaustion.

Critical Issues

4. Missing user-scoped client usage in proxy handlers (components/backend/websocket/agui_proxy.go:24-100)

Problem: Proxy handlers use package-level K8s clients without user token validation. Violates authentication pattern.
Reference: CLAUDE.md lines 344-348, .claude/patterns/k8s-client-usage.md

5. Error handling doesn't follow established patterns (components/backend/websocket/agui_proxy.go)

Problem: Generic error messages, no differentiation between error types, missing structured logging.
Reference: .claude/patterns/error-handling.md lines 9-40

6. Type safety violations in unstructured access

Need to verify: All obj.Object["field"].(Type) patterns replaced with unstructured.Nested* helpers.

Major Issues

7. Frontend: Multiple 'any' type violations (components/frontend/src/hooks/use-agui-stream.ts:66)

Problem: Violates Zero 'any' Types rule. pendingToolCalls and pendingChildren lack proper types.

8. Operator: Missing OwnerReferences validation (components/operator/internal/handlers/sessions.go)

Need to verify created resources have proper OwnerReferences set for cleanup.

9. Build metadata creates unnecessary cache invalidation (Makefile, Dockerfiles)

Problem: Every commit triggers full rebuild due to LABEL changes, defeating Docker layer caching.

10. Missing tests for new AG-UI functionality

No test files added for agui.go, agui_proxy.go, compaction.go, use-agui-stream.ts

Minor Issues

11. Inconsistent image naming - Changed from hyphens to underscores without migration path
12. Removed gorilla/websocket dependency - Verify no other code depends on it
13. Logging verbosity - High-frequency logging in hot path may impact performance
14. Frontend state handling - Verify all loading/error states handled properly

Positive Highlights

  • Architecture improvement: HTTP/SSE better than WebSocket
  • Type definitions: Comprehensive AG-UI protocol types
  • Build metadata: Embedding git info via ldflags is best practice
  • Code removal: 4,369 lines cleaned up
  • Documentation: Added docs/build-metadata.md
  • Mutex usage: Correct RWMutex for read-heavy workloads

Recommendations

Priority 1 (Must Fix Before Merge):

  1. Add RBAC validation to all AG-UI endpoints
  2. Implement user-scoped client usage in proxy handlers
  3. Fix global state synchronization
  4. Add channel cleanup and goroutine lifecycle management
  5. Fix frontend type safety

Priority 2 (Should Fix Before Merge):
6. Add comprehensive tests
7. Fix error handling patterns
8. Verify OwnerReferences
9. Document breaking changes
10. Add race detection to CI

Priority 3 (Can Address Post-Merge):
11. Optimize Docker caching strategy
12. Reduce logging verbosity
13. Add AG-UI metrics
14. Consider connection pooling

Files Requiring Immediate Attention

  1. components/backend/websocket/agui.go - Fix global state and goroutine leaks
  2. components/backend/websocket/agui_proxy.go - Add RBAC, user token validation
  3. components/backend/routes.go - Add middleware to AG-UI routes
  4. components/frontend/src/hooks/use-agui-stream.ts - Fix type safety
  5. components/operator/internal/handlers/sessions.go - Verify OwnerReferences

Verdict: Significant architectural improvement, but critical security and stability issues must be resolved before merging.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR introduces the AG-UI Protocol for event streaming, replacing the legacy WebSocket-based messaging system with HTTP/SSE. This is a major architectural change affecting all components (backend, frontend, operator, runner). The changes are extensive (7,459 additions, 4,369 deletions across 59 files).

Issues by Severity

Blocker Issues

None identified - No critical security vulnerabilities or architectural violations found.

Critical Issues

  1. Potential Goroutine Leak (components/backend/websocket/agui_proxy.go:145) - Uses context.Background() instead of request context. Goroutines persist even if client disconnects. Add cleanup mechanism when session is deleted.

  2. Token Logging Risk (components/backend/websocket/agui_proxy.go) - Audit all log statements to ensure no tokens are logged in error paths.

  3. Unsafe Type Assertions (components/backend/websocket/compaction.go:173-189) - Direct type assertions without checking ok value. Should use explicit error checking per established patterns.

  4. EventSource Memory Leak (components/frontend/src/hooks/use-agui-stream.ts:89-90) - EventSource connections may not be properly cleaned up on fast navigation.

Major Issues

  1. Missing RBAC Checks - Verify all new AG-UI endpoints have user authentication and RBAC validation like HandleAGUIRunProxy does.

  2. No Test Coverage - No tests added for new AG-UI endpoints. Need unit, integration, and contract tests.

  3. Large Component File - Session page grew by 442 lines (should be under 200 per DESIGN_GUIDELINES.md). Extract to separate components.

  4. Missing Documentation - No AG-UI protocol migration guide or ADR explaining the architectural change.

Minor Issues

  1. Image Tag Naming - Changed from vteam-frontend to vteam_frontend. Verify CI/CD and Kustomize manifests updated.

  2. Large Python File - adapter.py is 1,458 lines. Should be split into events.py, streaming.py, compaction.py modules.

  3. No Deprecation Notice - Removed files (runner-shell/, message-adapter.ts) without migration guide.

Positive Highlights

  • Excellent security practices (RBAC, token handling where implemented)
  • Well-structured AG-UI types following protocol spec
  • Comprehensive build metadata system with ldflags
  • Strong TypeScript typing in frontend
  • Proper goroutine safety with mutexes and buffered channels
  • Smart compact-on-read event persistence strategy

Recommendations

Before Merge:

  1. Add RBAC checks to all new AG-UI endpoints
  2. Audit token logging in proxy error paths
  3. Add cleanup for background goroutines
  4. Verify build metadata logged on startup
  5. Add basic tests for new endpoints

After Merge:
6. Add AG-UI documentation and ADR
7. Refactor large files (adapter.py, session page)
8. Add E2E tests for full AG-UI flow
9. Performance test goroutine cleanup
10. Update CI/CD for new image names

Architecture Validation

Follows Project Standards:

  • Backend uses Gin framework consistently
  • Operator maintains reconciliation patterns
  • Proper separation: types, handlers, services

Security Standards Met:

  • User token authentication (where checked)
  • RBAC enforcement (where implemented)
  • No sensitive data in logs (where audited)

Needs Verification:

  • All new endpoints have RBAC checks
  • No token values in any logs
  • EventSource cleanup prevents leaks
  • Background goroutines cleaned up

Overall Assessment

This is a well-executed major refactoring that modernizes the messaging architecture. The AG-UI protocol implementation is solid, with good type safety and security practices. Main concerns are goroutine lifecycle management, missing test coverage, and documentation gaps. Code quality is high.

Recommendation: Approve with required changes (address critical RBAC and goroutine cleanup issues before merge).


Review by Claude Code (claude-sonnet-4-5)
Context: CLAUDE.md, backend-development.md, frontend-development.md, security-standards.md, k8s-client-usage.md, error-handling.md, react-query-usage.md


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Removed the autoTriggerInitialPrompt function from the backend, delegating initial prompt execution to the runner on startup.
- Updated the StartSession handler to reflect changes in initial prompt handling, ensuring events flow through the backend for both UI and headless/API usage.
- Adjusted frontend logic to remove direct initial prompt sending, relying on backend auto-execution instead.
- Refactored session reconciliation logic to directly call runner endpoints for adding and removing repositories and changing workflows, enhancing integration and simplifying the codebase.

These changes improve the clarity and efficiency of session management, ensuring a more robust handling of initial prompts and repository workflows.

refactor: Enhance ToolCall structure and event handling

- Added ParentToolUseID field to the ToolCall struct for hierarchical tool call support.
- Refactored event handling in getCompactedMessages and compactEventsToState to utilize a map for tracking tool call states, improving management of parallel tool calls.
- Updated logic to ensure all relevant fields are captured during tool call events, enhancing event completeness and traceability.

These changes improve the clarity and functionality of tool call processing within the AGUI framework.

refactor: Streamline AG-UI event handling and improve logging

- Refactored the sendInitialSyncEvents function to load and compact events at read-time, enhancing performance and compliance with AG-UI specifications.
- Updated logging to provide clearer insights into event loading and message snapshot sending, improving traceability during event processing.
- Removed the getCompactedMessages function, consolidating message handling logic to simplify the codebase and improve maintainability.

These changes enhance the efficiency and clarity of AG-UI event handling, ensuring better performance and easier debugging.

refactor: Implement hierarchical tool message rendering

- Enhanced the ProjectSessionDetailPage to support hierarchical tool message rendering, allowing for nested tool calls to be displayed under their parent tools.
- Updated the StreamMessage and ToolMessage components to handle and visualize hierarchical tool messages, including child tool calls.
- Introduced new types for HierarchicalToolMessage and updated existing types to accommodate parent-child relationships in tool calls.

These changes improve the clarity and organization of tool call displays, enhancing the user experience in managing complex tool interactions.

refactor: Improve AGUI event handling and logging

- Refactored the handling of TOOL_CALL_START events to ensure all relevant fields are returned, enhancing event completeness.
- Added debug logging for tool call start and end events, providing better visibility into the tool call lifecycle and pending tool calls.
- Cleared pending tool calls after flushing to maintain state integrity.

These changes enhance the clarity and traceability of AGUI event processing.

refactor: Remove nested styling marker from ProjectSessionDetailPage

- Cleaned up the ProjectSessionDetailPage by removing the nested styling marker, as it is no longer necessary.
- Updated the useAGUIStream hook to ensure tool calls are added only if they exist, improving the handling of tool messages.

This refactor enhances code clarity and maintains functionality without the redundant nested marker.

feat: Implement hierarchical tool call rendering for sub-agents

- Runner: Emit parentToolUseId in TOOL_CALL_START events
- Backend: Add ParentToolUseID field to ToolCallStartEvent
- Frontend: Build tool hierarchy by nesting children under parent tools
- Frontend: Render nested tools under parent tool cards
- Frontend: Support sub-agent visualization with proper parent-child relationships

This enables proper visualization of sub-agent delegations where child tool calls
(WebSearch, etc.) are rendered inside the parent Task tool card.

fix: Adjust TEXT_MESSAGE_END emission logic in ClaudeCodeAdapter

- Updated the logic to delay emitting TEXT_MESSAGE_END until after processing ResultMessage.
- This change ensures that the stop_reason can be checked for 'tool_use', allowing for proper handling of subsequent blocks.

This fix improves the message handling flow and prevents premature termination of message sequences.

feat: Implement build metadata system for enhanced traceability

- Added build metadata capture in Makefile, including Git commit, branch, repository, version, build date, and user.
- Updated Dockerfiles across components to accept build arguments for metadata and set them as environment variables.
- Enhanced backend and operator to log build information on startup, embedding metadata directly into binaries.
- Introduced frontend instrumentation to log build details during server startup.
- Created documentation detailing the build metadata system and its usage.

This system improves traceability of deployed images and simplifies debugging by providing clear versioning information at runtime.

fix: AG-UI event streaming and tool call display fixes

- Backend: Return full event structs with all fields (toolCallId, toolCallName, delta, result)
- Backend: Preserve user message IDs from frontend to prevent duplicates
- Backend: Parse AG-UI events in getCompactedMessages() for message persistence on refresh
- Backend: Handle TEXT_MESSAGE_*, TOOL_CALL_*, and RAW events in legacy storage
- Frontend: Display tool calls immediately on TOOL_CALL_START (don't wait for END)
- Frontend: Show streaming tool args as they accumulate
- Frontend: Include toolCalls array in completed tool messages with args/result
- Frontend: Graceful error logging (suppress ResponseAborted, ECONNREFUSED during restarts)
- Frontend: Safer tool arg parsing with validation
- Runner: Emit TEXT_MESSAGE_END after processing all blocks (keeps tools with message)

Fixes: tool calls not showing, messages disappearing on refresh, duplicate user messages

feat: Enhance AG-UI Event Handling with Full Event Broadcasting

This commit introduces significant improvements to the AG-UI event handling system, enabling the broadcasting of full events with all fields, enhancing the messaging capabilities within sessions.

- **AGUIRunState**: Added `fullEventSub` map to manage full event subscribers.
- **BroadcastFull Method**: Implemented to send full events to subscribers, alongside legacy BaseEvent notifications.
- **Event Extraction**: Enhanced `extractBaseEvent` function to support various AG-UI event types.
- **Session Handling**: Updated `HandleAGUIRun` and `HandleAGUIEvents` to accommodate full event subscriptions and broadcasting.
- **Frontend Integration**: Adjusted hooks and components to utilize the new event streaming capabilities.

- Improved real-time event handling for enhanced user experience.
- Simplified event management by consolidating full event broadcasting logic.
- Increased flexibility in handling different event types within the AG-UI framework.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

feat: Integrate AG-UI Protocol for enhanced session messaging

This commit introduces the AG-UI Protocol to improve the messaging system within sessions, allowing for real-time streaming and better event handling.

- **Backend**: Added AG-UI endpoints for session management in `routes.go`.
- **WebSocket**: Implemented broadcasting of AG-UI events in `hub.go`.
- **Frontend**: Replaced legacy message handling with AG-UI streaming in `page.tsx`, including hooks for managing AG-UI streams.
- **Message Adapter**: Removed the old message adapter and integrated new message handling logic directly in the session detail page.
- **UI Components**: Updated message components to support streaming indicators and new message formats.

- Enhanced user experience with real-time message updates.
- Simplified message handling logic by removing the old adapter.
- Improved maintainability and scalability of the messaging system.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

refactor: Transition to AG-UI Protocol for session messaging

- Removed legacy message handling endpoints in `routes.go`, consolidating to AG-UI protocol.
- Updated WebSocket handlers to route AG-UI events directly, simplifying message processing.
- Eliminated deprecated functions for message retrieval and control messaging, enhancing clarity and maintainability.
- Integrated AG-UI event streaming in frontend components, replacing previous message handling logic.
- Improved logging for AG-UI event processing, providing better insights during runtime.

These changes streamline the messaging system, ensuring real-time updates and a more efficient architecture.

refactor: Streamline AGUI event handling and improve compaction strategy

- Updated `RouteAGUIEvent` to prevent lazy run creation for terminal events, enhancing event processing logic.
- Removed legacy functions `loadCompactedMessages` and `compactAndPersistRun`, transitioning to a "compact-on-read" strategy to eliminate race conditions and simplify event management.
- Enhanced logging for AGUI events to provide better insights during runtime.
- Updated WebSocket message handling to ensure proper event type assignment and routing.

These changes improve the efficiency and clarity of AGUI event handling, ensuring better performance and easier debugging.

refactor: Enhance AGUI event handling and compaction logic

- Updated the `streamThreadEvents` function to only compact events from completed runs, improving event processing and preventing duplicates from active runs.
- Introduced logic to filter out events from active runs and log relevant information for better traceability.
- Modified the `MessageCompactor` to ensure in-progress tool calls are not included in snapshots, enhancing the clarity of event messages.
- Enhanced the frontend to support rendering of pending tool calls and children, improving the user experience during streaming.
- Updated the `useAGUIStream` hook to manage pending tool calls and children, ensuring accurate state representation in the UI.

These changes streamline AGUI event handling, improve performance, and enhance the overall user experience during tool call streaming.

feat: Add python-struct dependency and enhance JSON parsing for Python literals

- Introduced the `python-struct` package to handle structured data parsing.
- Implemented a helper function `pythonLiteralToJson` to convert Python literal strings to JSON-parseable format, improving compatibility with legacy data.
- Enhanced the `extractTextFromResultContent` function to support parsing both JSON and Python notations, ensuring robust handling of various content formats.
- Updated the `ClaudeCodeAdapter` to use `json.dumps()` for proper JSON serialization of results, enhancing data integrity in tool calls.

These changes improve data handling and parsing capabilities, facilitating better integration with Python-based content.

refactor: Enhance AGUI event logging and compaction logic

- Improved logging for AGUI events, including detailed messages for user events and active run processing.
- Updated the `persistAGUIEventMap` function to log successful persistence of user messages.
- Enhanced the `streamThreadEvents` function to accurately determine active runs by checking for terminal events, improving event handling.
- Modified the `MessageCompactor` to flush user messages immediately upon completion, ensuring timely processing.
- Added checks for actual content in tool messages to improve user experience and clarity in the frontend.

These changes enhance the clarity and efficiency of AGUI event handling, providing better insights during runtime and improving overall user experience.

feat: Implement AG-UI Protocol for enhanced session management and event streaming

- Introduced a new AG-UI server for Claude Code, replacing legacy WebSocket communication with HTTP/SSE for event streaming.
- Updated backend routes to support AG-UI protocol, including new endpoints for session management and event handling.
- Implemented a proxy mechanism in the backend to forward AG-UI run requests to the FastAPI server, enhancing real-time communication.
- Enhanced frontend components to utilize the new AG-UI event streaming capabilities, ensuring seamless integration with the updated backend.
- Improved logging and error handling for AG-UI events, providing better insights during runtime.

These changes significantly enhance the messaging system, enabling real-time updates and a more efficient architecture for session management.

refactor: Remove legacy WebSocket handling and transition to AG-UI protocol

- Eliminated WebSocket-related code, including handlers and connection management, as the AG-UI server now utilizes HTTP/SSE for event streaming.
- Updated backend to remove references to WebSocket URLs and legacy message handling, streamlining session communication.
- Adjusted frontend components to load initial message history via GET /agui/events and connect to AG-UI streams for live updates.
- Enhanced logging and error handling to align with the new AG-UI protocol, improving runtime insights.

These changes significantly modernize the messaging architecture, ensuring a more efficient and maintainable system.

refactor: Update AGUI proxy handling and improve event streaming

- Refactored the AGUI proxy to initiate background processing for HTTP requests, ensuring the connection remains open for SSE streams.
- Enhanced error handling and logging for background requests, improving traceability of failures.
- Updated frontend components to reflect changes in AGUI run endpoint behavior, returning metadata immediately instead of streaming responses.
- Improved handling of tool call events in the MessageCompactor to support both camelCase and snake_case formats, enhancing compatibility with different data sources.

These changes streamline the AGUI event handling process, ensuring better performance and user experience during session management.

feat: Add interrupt functionality for AG-UI sessions

- Introduced a new endpoint to handle interrupt signals for AG-UI sessions, allowing users to stop ongoing executions.
- Implemented backend logic to forward interrupt requests to the Claude SDK, ensuring seamless integration with the existing session management.
- Updated frontend components to utilize the new interrupt functionality, providing users with a clear option to stop active runs.
- Enhanced message compaction logic to filter out hidden messages, improving the clarity of displayed messages during session interactions.

These changes enhance user control over session executions and improve the overall user experience in managing AG-UI sessions.

Enhance AGUI security and authorization checks

- Added user authentication and authorization checks for AGUI session management, ensuring users have the necessary permissions to update and read session data.
- Implemented SelfSubjectAccessReview for verifying user permissions against Kubernetes resources, improving security posture.
- Updated event handling functions to include context cancellation checks, enhancing robustness during long-running operations.

These changes strengthen the security framework of the AGUI system, ensuring proper access control and improving overall event handling reliability.

Refactor AGUI event handling and logging

- Removed unnecessary logging statements from the AGUI backend, improving code clarity and reducing log noise.
- Updated frontend components to eliminate redundant console logs, enhancing performance and maintainability.
- Streamlined event processing logic to focus on essential operations, ensuring efficient handling of AGUI events.

These changes enhance the overall efficiency of AGUI event management and improve the clarity of the codebase.
@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...onents/runners/claude-code-runner/observability.py 66.66% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR introduces the AG-UI protocol for message streaming, replacing the legacy WebSocket system with HTTP/SSE-based event streaming. This is a major architectural change that touches 61 files across backend, frontend, operator, and runner components.

Overall Assessment: The implementation follows most project standards, but there are several critical security and architectural issues that must be addressed before merge.


Issues by Severity

🚫 Blocker Issues

1. RBAC Authorization Checks Missing in Critical Endpoints

components/backend/websocket/agui_proxy.go:32-58

GOOD: The proxy correctly uses GetK8sClientsForRequest and checks permissions
MISSING: Similar checks needed in HandleAGUIEvents, HandleAGUIHistory, HandleAGUIRuns

Evidence:

// agui_proxy.go - CORRECT pattern
reqK8s, _ := handlers.GetK8sClientsForRequest(c)
if reqK8s == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
    return
}
// RBAC check follows...

But in agui.go, similar endpoints lack these checks. This violates Critical Rule #1 from CLAUDE.md:

REQUIRED: Always use GetK8sClientsForRequest(c) to get user-scoped K8s clients

Impact: Users could access sessions they don't have permission to view.

Fix Required: Add identical authentication and RBAC checks to all AGUI endpoints.


2. Direct Type Assertions Without Safety Checks

components/backend/websocket/compaction.go:126-129

messageID, _ := event["messageId"].(string)  // ❌ Ignoring 'ok' check
role, _ := event["role"].(string)            // ❌ Ignoring 'ok' check

This violates Critical Rule #4 from CLAUDE.md:

FORBIDDEN: Direct type assertions without checking: obj.Object["spec"].(map[string]interface{})

Impact: Panics if event structure is unexpected.

Fix Required: Check the ok value or use unstructured.NestedString() helpers.


3. Goroutine Without Context Cancellation

components/backend/websocket/agui_proxy.go:145-200

go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ...
}()

Issue: Uses context.Background() instead of request context. While the comment explains this is intentional, there's no mechanism to cancel these goroutines if the session is deleted or the backend restarts.

Impact: Goroutine leaks, unbounded resource consumption.

Fix Required: Add session deletion listener or backend shutdown handler to cancel these contexts.


🔴 Critical Issues

4. Frontend: Missing Type Safety

components/frontend/src/hooks/use-agui-stream.ts:186-190

if (hiddenMessageIdsRef.current.has(messageId)) {
  newState.currentMessage = null;
  return newState;
}

No type guard to ensure messageId is a string before calling .has(). Should use:

if (typeof messageId === 'string' && hiddenMessageIdsRef.current.has(messageId)) {

This violates Frontend Critical Rule #1: Zero any types - use proper type guards.


5. Error Handling: Generic Error Messages

components/backend/websocket/agui_proxy.go:124

c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Runner not available"})

GOOD: Generic message to user
MISSING: Detailed logging of the actual error

Should be:

log.Printf("Failed to get runner endpoint for %s/%s: %v", projectName, sessionName, err)
c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Runner not available"})

6. Python: Missing Exception Handling in Adapter

components/runners/claude-code-runner/adapter.py:111-122

The process_run method can raise exceptions but doesn't wrap them in RunErrorEvent. The FastAPI server should catch these and emit proper AG-UI error events.


🟡 Major Issues

7. Build Metadata: Embedding in Binaries Not Verified

Makefile:95-105

Build args are passed correctly, but the review cannot confirm these are actually embedded in the binaries. Verify with:

strings <binary> | grep -i "git_commit"

8. Frontend: Large File Size

components/frontend/src/hooks/use-agui-stream.ts: 791 lines

Per Frontend Pre-Commit Checklist: Components should be under 200 lines.

Recommendation: Extract event processing logic into separate files:

  • use-agui-stream.ts (orchestration)
  • event-processors.ts (event handling logic)
  • agui-state.ts (state management)

9. No Tests Added

A PR this large (7475 additions, 4384 deletions) should include:

  • Backend integration tests for AGUI endpoints
  • Frontend tests for useAGUIStream hook
  • E2E tests for message streaming flow

10. Removed Components Without Migration Guide

Files removed:

  • components/backend/websocket/handlers.go (333 lines)
  • components/backend/websocket/hub.go (202 lines)
  • components/runners/runner-shell/ (entire directory)

No migration guide or deprecation notice in commit message or documentation.


🔵 Minor Issues

11. Inconsistent Logging

Some functions log verbosely, others are silent. Example:

components/backend/websocket/agui.go:60: Logs subscriber addition
components/backend/websocket/agui.go:84: No log for full event broadcast

Recommendation: Standardize logging levels and add debug logs consistently.


12. Magic Numbers

components/backend/websocket/agui_proxy.go:147

context.WithTimeout(context.Background(), 2*time.Hour)

Should use a named constant: const MaxRunDuration = 2 * time.Hour


13. Frontend: Python Literal Parsing

components/frontend/src/components/ui/tool-message.tsx: Added python-struct dependency for parsing Python literals.

Question: Why are Python literals in JSON responses? This suggests the backend is not properly serializing data. Fix at the source (Python json.dumps()) rather than client-side workarounds.


Positive Highlights

Excellent security practices in agui_proxy.go with proper user token authentication and RBAC checks

Good separation of concerns between backend proxy and runner FastAPI server

Type safety in AG-UI event types (both Go and TypeScript)

Proper use of SSE for event streaming instead of WebSocket (simpler, more reliable)

Build metadata system improves traceability and debugging

Comprehensive AG-UI protocol types with clear documentation references

Error handling patterns mostly follow established standards


Recommendations

Before Merge (Blockers)

  1. Add RBAC checks to all AGUI endpoints (HandleAGUIEvents, HandleAGUIHistory, etc.)
  2. Fix type assertions - check ok values or use helper functions
  3. Add goroutine lifecycle management - handle session deletion and shutdown
  4. Add type guards in frontend event handlers
  5. Add detailed error logging throughout

After Merge (Critical)

  1. Add integration tests for AGUI endpoints
  2. Add frontend tests for useAGUIStream
  3. Fix Python serialization at source instead of client-side parsing
  4. Split large files - refactor 791-line hook into smaller modules

Nice-to-Have (Major/Minor)

  1. Add migration guide for deprecated WebSocket system
  2. Standardize logging across AGUI components
  3. Replace magic numbers with named constants
  4. Verify build metadata is actually embedded in binaries

Security Checklist Status

  • User-scoped clients used (agui_proxy.go)
  • MISSING: User-scoped clients in other AGUI endpoints
  • RBAC checks performed (agui_proxy.go)
  • MISSING: RBAC checks in other AGUI endpoints
  • No tokens logged
  • PARTIAL: Type-safe access needs improvement
  • Error messages don't expose internals
  • MISSING: Tests

Final Verdict

This PR implements a well-architected protocol migration, but has 3 blocker security/stability issues that must be fixed before merge:

  1. Missing RBAC checks in AGUI endpoints
  2. Unsafe type assertions
  3. Unmanaged goroutine lifecycle

The positive aspects (clean protocol design, proper SSE usage, build metadata) are excellent. Once the blockers are addressed, this will be a solid foundation for AG-UI protocol support.

Recommendation: Address blockers, then merge with follow-up issues for tests and refactoring.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR introduces the AG-UI protocol for message streaming, replacing the legacy WebSocket-based communication with HTTP/SSE. This is a major architectural refactoring involving 61 files with ~7,500 additions and ~4,400 deletions across backend, frontend, operator, and runner components.

Overall Assessment: The implementation shows solid technical execution with good adherence to AG-UI protocol specifications. However, there are several critical security, error handling, and code quality issues that must be addressed before merging.


Issues by Severity

🚫 Blocker Issues

1. CRITICAL: Goroutine Leak Risk in Background Proxy

File: components/backend/websocket/agui_proxy.go:145-150

go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ... long-running HTTP request
}()

Problem: Uses context.Background() for a 2-hour timeout in a detached goroutine. This creates unbounded goroutine accumulation if:

  • Runner crashes/restarts during streaming
  • Network connection hangs
  • Session gets deleted while streaming

Required Fix:

  • Track goroutines in a registry and clean up on session deletion
  • Use session-scoped context that cancels when AgenticSession CR is deleted
  • Add proper cleanup in operator when session is removed

Reference: CLAUDE.md lines 817-851 (Goroutine Monitoring pattern)


2. CRITICAL: Missing Error Context in Backend Handlers

Files:

  • components/backend/websocket/agui_proxy.go:32-58
  • components/backend/websocket/agui.go:442-488
reqK8s, _ := handlers.GetK8sClientsForRequest(c)  // ❌ Ignoring error
if reqK8s == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
}

Problem: Violates CLAUDE.md error handling pattern (lines 529-552). Should:

  1. Check error return value
  2. Log errors with context (project/session names)
  3. Distinguish between "no token" vs "invalid token" vs "K8s API failure"

Required Fix:

reqK8s, reqDyn := handlers.GetK8sClientsForRequest(c)
if reqK8s == nil {
    log.Printf("AGUI: Failed to create K8s client for %s/%s", projectName, sessionName)
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
    c.Abort()
    return
}

3. CRITICAL: Type Assertion Without Safety Check

File: components/backend/websocket/agui.go:156-159

eventType, ok := event["type"].(string)
if !ok {
    log.Printf("AGUI: Event missing type field, skipping")
    return
}

Problem: Only checks type field. Multiple other fields use unsafe type assertions:

  • Line 175: isTerminalEventType(eventType) without validating eventType
  • Line 180: event["runId"].(string) - can panic if not string

Required Fix: Use unstructured.Nested* helpers per CLAUDE.md lines 441-445


🔴 Critical Issues

4. Security: RBAC Check Missing for GET Events Endpoint

File: components/frontend/src/app/api/projects/[name]/agentic-sessions/[sessionName]/agui/events/route.ts:1-94

Problem: Frontend proxies to backend without RBAC validation. While backend has auth on /agui/events, the frontend Next.js API route should also validate user permissions before proxying.

Current:

export async function GET(request: Request, { params }: { params: Promise<RouteParams> }) {
  // No RBAC check here - just proxies to backend
}

Required: Add permission check or document that frontend routes are public (rely on backend auth only).

Reference: CLAUDE.md lines 638-657 (RBAC Enforcement)


5. Error Handling: Silent Failures in Message Persistence

File: components/backend/websocket/agui.go:200-250 (persistAGUIEventMap calls)

Problem: Multiple go persistAGUIEventMap(...) calls without error handling:

  • Line 176: Terminal events
  • Line 197: Events without active run

Impact: Message loss is silent - users won't know events failed to persist.

Required Fix: Add error logging and status field in AgenticSession CR for persistence failures.


6. Frontend: Unsafe JSON Parsing

File: components/frontend/src/components/ui/tool-message.tsx:150-170

Problem: Uses JSON.parse() without try-catch for tool args/results. Will crash UI on malformed JSON.

Current:

const parsed = JSON.parse(toolCall.args)  // ❌ Can throw

Required Fix:

try {
  const parsed = JSON.parse(toolCall.args)
  // use parsed
} catch (e) {
  // Fallback to raw string display
}

Reference: CLAUDE.md lines 1115-1136 (Frontend error handling)


🟡 Major Issues

7. Code Quality: 791-Line Hook Violates Size Limit

File: components/frontend/src/hooks/use-agui-stream.ts (791 lines)

Problem: Violates CLAUDE.md frontend guideline (line 1128): "Components under 200 lines"

Required: Extract sub-hooks:

  • useEventProcessor for event handling logic
  • useMessageState for message state management
  • useSSEConnection for EventSource lifecycle

8. Architecture: 1157-Line Backend File

File: components/backend/websocket/agui.go (1157 lines)

Problem: Violates Go package organization (CLAUDE.md lines 454-494). Should split into:

  • agui_state.go - Run state management
  • agui_events.go - Event routing/processing
  • agui_persistence.go - Event storage
  • agui_compaction.go - Message compaction (already exists)

9. Missing Tests for New AG-UI Endpoints

Problem: No test files added for critical new functionality:

  • HandleAGUIRunProxy
  • HandleAGUIEvents
  • HandleAGUIInterrupt
  • Message compaction logic

Required: Add integration tests (see CLAUDE.md lines 1037-1044)


10. Frontend: Multiple any Type Violations

File: components/frontend/src/hooks/use-agui-stream.ts

Problem:

  • Line 103: (prev: any) should be (prev: AGUIClientState)
  • Line 265: event: any should be event: AGUIEvent

Severity: Violates CLAUDE.md rule #1 (line 1115): "Zero any Types"

Required Fix: Use proper types from @/types/agui


11. Operator: Direct Runner Endpoint Calls Without Validation

File: components/operator/internal/handlers/sessions.go:127-189

Problem: Operator now calls runner HTTP endpoints directly for repo/workflow changes. No validation that:

  • Runner pod is actually ready
  • Endpoint exists (could be old runner version)
  • Request succeeded (errors logged but not retried)

Risk: Silent failures if runner restarts during operation.

Required: Add retry logic with exponential backoff (see CLAUDE.md lines 556-572).


🔵 Minor Issues

12. Documentation: No ADR for AG-UI Migration

Problem: This is a major architectural change (WebSocket → HTTP/SSE) but no ADR documenting:

  • Why AG-UI protocol over WebSocket
  • Trade-offs considered
  • Migration path for existing sessions

Suggested: Add docs/adr/0006-agui-protocol.md


13. Build Metadata: Hardcoded String in Logging

File: components/backend/main.go:33-34

log.Printf("Build: %s", buildInfo)  // Generic log message

Suggestion: Use structured logging:

log.Printf("Backend starting - Version: %s, Commit: %s, Built: %s", 
    GitVersion, GitCommit, BuildDate)

14. Frontend: Console Logs Left in Production Code

Files:

  • components/frontend/src/hooks/use-agui-stream.ts:285-290
  • components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx:150-155

Problem: Multiple console.log statements should be removed or gated behind process.env.NODE_ENV === 'development'


15. Python: Missing Type Hints in New Adapter

File: components/runners/claude-code-runner/adapter.py (1483 lines)

Problem: Functions lack type hints (violates Python best practices):

def process_event(self, event):  # ❌ No types
    ...

Should be:

def process_event(self, event: dict[str, Any]) -> None:
    ...

Positive Highlights

Security: User token authentication properly implemented in AGUI proxy handlers
Security: RBAC checks added before run/interrupt operations
Architecture: Clean separation of AG-UI types in components/backend/types/agui.go
Frontend: Proper use of React Query for data fetching (no manual fetch)
Frontend: Shadcn UI components used consistently
Testing: Build metadata system is well-documented (docs/build-metadata.md)
Code Quality: Message compaction logic properly separated into compaction.go
Performance: EventSource streaming with SSE more efficient than WebSocket polling


Recommendations

Must Fix Before Merge (Blockers)

  1. ✅ Fix goroutine leak in background proxy (add cleanup on session delete)
  2. ✅ Add error context logging in all AGUI handlers
  3. ✅ Replace unsafe type assertions with unstructured.Nested* helpers

Should Fix Before Merge (Critical)

  1. Add RBAC validation to frontend API routes or document auth model
  2. Add error handling for message persistence failures
  3. Add try-catch for JSON parsing in tool-message.tsx

Fix in Follow-up PR (Major)

  1. Split use-agui-stream.ts into smaller hooks (<200 lines each)
  2. Split agui.go into logical files (<500 lines each)
  3. Add integration tests for AG-UI endpoints
  4. Remove any types from frontend hooks
  5. Add retry logic for operator→runner HTTP calls

Optional Improvements (Minor)

  1. Add ADR documenting AG-UI protocol decision
  2. Use structured logging for build metadata
  3. Remove console.log statements from production code
  4. Add type hints to Python adapter functions

Overall: This is excellent work on a complex migration. The core implementation is solid, but the security and error handling gaps must be addressed before merging. The blocker issues (#1-3) are critical and could cause production incidents.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Updated the event handling logic in `handleStreamedEvent` to use a switch statement for better clarity and maintainability.
- Removed deprecated functions and unnecessary variables from `agui.go`, streamlining the codebase.
- Enhanced the `MessageCompactor` to improve message visibility by removing unnecessary whitespace.

These changes enhance the clarity and efficiency of AGUI event processing, ensuring better performance and easier debugging.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR implements the AG-UI protocol for message streaming. Major architectural change with ~5,000 lines modified. Has critical security and code quality issues that must be addressed.

Blocker Issues

1. Panic in Production Code (agui.go:383-390) - Violates CLAUDE.md Rule 2
2. Missing User Token Auth (agui.go:300-350) - Violates CLAUDE.md Rule 1 - SECURITY RISK
3. Missing Context Cancellation (agui_proxy.go:145-200) - Resource leak risk

Critical Issues

4. Unchecked Type Assertions - Violates CLAUDE.md Rule 4
5. Global Mutable State - StateBaseDir not synchronized
6. Token in URL Params - Security concern

Major Issues

  • WebSocket removed without deprecation path (breaking change)
  • Silent error failures
  • Unvalidated build metadata (injection risk)
  • Unsafe error suppression in frontend

Positive Highlights

  • Excellent RBAC in proxy handlers
  • Well-designed build metadata system
  • Comprehensive type definitions
  • Proper mutex usage
  • Clean dependency removal

Required Before Merge

  1. Add user token authentication to HandleAGUIEvents and HandleAGUIHistory
  2. Remove panic recovery, use explicit errors
  3. Fix context cancellation checks
  4. Review all type assertions
  5. Protect StateBaseDir with sync
  6. Run: gofmt, go vet, golangci-lint

Recommendation: DO NOT MERGE until blockers fixed.

Full details available on request. cc @Gkrumbach07


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Updated `loadEventsForRun` to include automatic migration from legacy message formats if the event log file does not exist.
- Added error handling and logging for migration failures, ensuring better traceability during the event loading process.
- Implemented filtering of events by `runID` to improve event retrieval accuracy.

These changes improve the robustness of event loading and enhance compatibility with legacy data formats.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR implements the AG-UI Protocol for event streaming, replacing the legacy WebSocket-based messaging system with HTTP/SSE (Server-Sent Events). The change includes:

  • Backend: New AG-UI endpoints, proxy to runner's FastAPI server, event compaction
  • Frontend: EventSource-based streaming, hierarchical tool call rendering
  • Runner: Claude Code adapter emitting AG-UI events
  • Build System: Embedded git metadata for traceability

Overall, this is a significant architectural improvement that modernizes the messaging system. However, there are critical security issues and several areas requiring attention before merge.


Issues by Severity

🚫 Blocker Issues

MUST FIX BEFORE MERGE:

1. Goroutine Leak Risk in Background Proxy (agui_proxy.go:145)

// Uses context.Background() - detached from request lifecycle
go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ... long-running HTTP stream processing
}()

Problem:

  • Background goroutine with 2-hour timeout
  • No mechanism to stop on session deletion or pod restart
  • Could accumulate unbounded goroutines if many sessions created
  • Comment says "timeout prevents unbounded accumulation" but 2 hours is too long

Fix Required:

  • Use context from session lifecycle (track in AGUIRunState)
  • Add cleanup on session deletion/pod shutdown
  • Reduce timeout or add max concurrent runs limit

2. Type Assertion Without Safety Checks (agui.go:106-111)

eventType, ok := event["type"].(string)
if \!ok {
    log.Printf("AGUI: Event missing type field, skipping")
    return  // Silent failure
}

Problem:

  • Violates CLAUDE.md Rule Epic: AI Agent Development #4: "REQUIRED: Use unstructured.Nested* helpers"
  • Direct type assertions on map[string]interface{}
  • Could panic on malformed events from runner

Fix Required:

eventType := ""
if t, ok := event["type"].(string); ok {
    eventType = t
} else {
    log.Printf("AGUI: Invalid event type (expected string): %v", event["type"])
    return
}

Apply to all type assertions in: agui.go, agui_proxy.go, compaction.go

3. Missing RBAC Check in HandleAGUIEvents (agui.go:620-650)

func HandleAGUIEvents(c *gin.Context) {
    // ... extracts projectName, sessionName
    
    // ❌ NO RBAC CHECK HERE
    // Missing: GetK8sClientsForRequest + SelfSubjectAccessReview
    
    // Directly streams events to client
}

Problem:

  • HandleAGUIRunProxy has proper RBAC checks (lines 32-58)
  • HandleAGUIEvents (SSE stream) bypasses authorization
  • User could watch sessions they don't have permission to view

Fix Required:
Add RBAC check matching HandleAGUIRunProxy pattern:

reqK8s, _ := handlers.GetK8sClientsForRequest(c)
if reqK8s == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
    return
}

ssar := &authv1.SelfSubjectAccessReview{
    Spec: authv1.SelfSubjectAccessReviewSpec{
        ResourceAttributes: &authv1.ResourceAttributes{
            Group:     "vteam.ambient-code",
            Resource:  "agenticsessions",
            Verb:      "get",  // or "watch"
            Namespace: projectName,
            Name:      sessionName,
        },
    },
}
res, err := reqK8s.AuthorizationV1().SelfSubjectAccessReviews().Create(ctx, ssar, metav1.CreateOptions{})
if err \!= nil || \!res.Status.Allowed {
    c.JSON(http.StatusForbidden, gin.H{"error": "Unauthorized"})
    return
}

4. Removed WebSocket Dependency But Still in go.mod

The PR removed handlers.SendMessageToSession (line 44) and all WebSocket code, but:

-	github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674

Problem:

  • Dependency removed from go.sum but not from go.mod
  • Suggests incomplete cleanup

Fix Required:

cd components/backend
go mod tidy

🔴 Critical Issues

SHOULD FIX BEFORE MERGE:

5. No Error Handling for Event Persistence (agui.go:190)

// Persist the event (use runID from event, not activeRunState)
go persistAGUIEventMap(sessionID, runID, event)

Problem:

  • Fire-and-forget goroutine
  • No error logging if persistence fails
  • Users could lose message history on pod restart

Fix:
Add error handling inside persistAGUIEventMap:

func persistAGUIEventMap(sessionID, runID string, event map[string]interface{}) {
    if err := doActualPersist(sessionID, runID, event); err \!= nil {
        log.Printf("AGUI: Failed to persist event for %s/%s: %v", sessionID, runID, err)
    }
}

6. Frontend: Unsafe JSON Parsing (use-agui-stream.ts:285-295)

const data = JSON.parse(event.data)
processEvent(data as AGUIEvent)

Problem:

  • No validation that parsed data is actually an AGUIEvent
  • Could crash on malformed JSON or wrong event shape
  • No TypeScript safety - uses unsafe as cast

Fix:

try {
  const data = JSON.parse(event.data)
  if (typeof data === 'object' && data \!== null && 'type' in data) {
    processEvent(data as AGUIEvent)
  } else {
    console.warn('Received invalid event shape:', data)
  }
} catch (err) {
  console.error('Failed to parse SSE event:', err, event.data)
}

7. Race Condition: Run State Cleanup (agui.go:197-198)

// Schedule cleanup of run state (no need to compact async - we compact on SSE connect)
go scheduleRunCleanup(runID, 5*time.Minute)

Problem:

  • What if same runID gets reused within 5 minutes?
  • No check if run was restarted before cleanup fires
  • Could delete active run state

Fix:
Check run status before cleanup:

func scheduleRunCleanup(runID string, delay time.Duration) {
    time.Sleep(delay)
    
    aguiRunsMu.Lock()
    defer aguiRunsMu.Unlock()
    
    run, exists := aguiRuns[runID]
    if exists && run.Status \!= "running" {  // Only clean non-active runs
        delete(aguiRuns, runID)
    }
}

🟡 Major Issues

IMPORTANT TO ADDRESS:

8. No Tests for AG-UI Endpoints

Missing:

  • Unit tests for HandleAGUIRunProxy, HandleAGUIEvents, etc.
  • Integration tests for event persistence/compaction
  • Tests for RBAC enforcement on new endpoints

Required:
Add tests in components/backend/tests/ covering:

  • RBAC rejection for unauthorized users
  • Event streaming happy path
  • Error handling (runner down, malformed events)
  • Goroutine cleanup on session deletion

9. Logging Violations (agui_proxy.go:68)

log.Printf("AGUI Proxy: Input has %d messages", len(input.Messages))

Problem:

  • CLAUDE.md: "NEVER log tokens or sensitive data"
  • Messages could contain sensitive content (API keys, credentials)
  • Should only log metadata, not message counts that reveal usage patterns

Fix:

log.Printf("AGUI Proxy: Starting run %s for session %s/%s", runID, projectName, sessionName)
// Don't log message count - could leak usage patterns

10. Build Metadata Embedded But Not Displayed

The Makefile/Dockerfile changes add build metadata (GIT_COMMIT, etc.), and backend logs it on startup (line 47), but:

  • Frontend has instrumentation.ts (added) but doesn't log build info
  • Operator has build args but no startup logging

Fix:
Add to components/operator/main.go:

func logBuildInfo() {
    log.Println("==============================================")
    log.Println("Operator - Build Information")
    log.Printf("Version:     %s", GitVersion)
    log.Printf("Commit:      %s", GitCommit)
    // ...
}

And call in main().

11. EventSource Reconnection Handling (use-agui-stream.ts:280-320)

eventSource.onerror = (err) => {
  console.error('EventSource error:', err)
  // ... reconnects automatically
}

Problem:

  • No exponential backoff on reconnect
  • Could spam server if endpoint is down
  • No max retry limit

Fix:
Implement exponential backoff:

const reconnectDelayRef = useRef(1000) // Start at 1s
const maxReconnectDelay = 30000 // Cap at 30s

// On error:
reconnectDelayRef.current = Math.min(
  reconnectDelayRef.current * 2,
  maxReconnectDelay
)
setTimeout(() => connect(), reconnectDelayRef.current)

// On success:
reconnectDelayRef.current = 1000 // Reset

🔵 Minor Issues

NICE-TO-HAVE IMPROVEMENTS:

12. Makefile: Image Tag Format Change

-FRONTEND_IMAGE ?= vteam-frontend:latest
+FRONTEND_IMAGE ?= vteam_frontend:latest

Problem:

  • Changes from kebab-case to snake_case
  • Breaking change for existing deployments
  • No migration guide in PR description

Recommendation:

  • Document this change in PR description
  • Update deployment docs if needed
  • Or revert to kebab-case for compatibility

13. Documentation Gap: AG-UI Protocol

The PR adds extensive AG-UI implementation but:

  • No developer guide on how to extend it
  • No explanation of event compaction strategy
  • Reference comments point to external docs (docs.ag-ui.com) but nothing local

Recommendation:
Add docs/architecture/agui-protocol.md covering:

  • Event flow diagram
  • Compaction strategy (compact-on-read)
  • How to add new event types
  • Debugging guide

14. Unused Type Definition (types/agui.go:31)

EventTypStateDelta     = "STATE_DELTA"  // Typo: missing 'e'

Fix:

EventTypeStateDelta = "STATE_DELTA"

15. Frontend: Magic Numbers (use-agui-stream.ts:49)

ch := make(chan *types.BaseEvent, 100)  // Why 100?

Recommendation:

const eventChannelBufferSize = 100  // Tune based on expected burst rate
ch := make(chan *types.BaseEvent, eventChannelBufferSize)

Positive Highlights

Well-Architected Migration

  • Clean separation of concerns (proxy, compaction, storage)
  • Proper use of SSE for event streaming (better than WebSocket for this use case)

Security-Conscious (mostly)

  • RBAC checks in HandleAGUIRunProxy
  • User token validation before proxying requests

Frontend Best Practices

  • TypeScript types for AG-UI protocol
  • React hooks pattern for stream management
  • Hierarchical tool call rendering

Build Metadata System

  • Excellent traceability addition
  • Proper use of ldflags for Go builds
  • Documentation in docs/build-metadata.md

Code Cleanup

  • Removed 2147 lines of legacy wrapper.py
  • Deleted unused WebSocket code
  • Consolidated message handling

Recommendations

Must Do (Blockers):

  1. ✅ Add RBAC check to HandleAGUIEvents
  2. ✅ Fix goroutine leak risk in background proxy
  3. ✅ Fix type assertions to use safe patterns
  4. ✅ Run go mod tidy

Should Do (Critical):

  1. ✅ Add error handling to event persistence
  2. ✅ Fix unsafe JSON parsing in frontend
  3. ✅ Fix race condition in run cleanup
  4. ✅ Add tests for new AG-UI endpoints

Nice to Have:

  1. Document image tag format change
  2. Add AG-UI protocol docs
  3. Implement EventSource backoff
  4. Fix typo in EventTypStateDelta

Verdict

Status: ⚠️ DO NOT MERGE YET

This is a solid architectural improvement, but the 3 blocker security issues must be fixed first:

  1. Missing RBAC in SSE endpoint
  2. Goroutine leak risk
  3. Unsafe type assertions

Once blockers are addressed, this will be a significant quality upgrade to the messaging system.


Estimated Fix Time: 2-4 hours for blockers + critical issues

🤖 Reviewed using Claude Code with full context from CLAUDE.md, security-standards.md, and all pattern files


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Removed unnecessary logging statements in `compaction.go` to streamline message processing and improve clarity.
- Updated `MigrateLegacySessionToAGUI` to handle legacy messages in JSONL format, enhancing compatibility with existing data.
- Improved error handling and logging during legacy message migration, ensuring better traceability and user feedback.

These changes enhance the efficiency of message compaction and improve the robustness of legacy data handling.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR introduces the AG-UI protocol for message streaming, replacing the legacy WebSocket-based communication with a standardized HTTP/SSE approach. The implementation is comprehensive (7,420 additions) and includes:

  • Backend AG-UI proxy endpoints with SSE streaming
  • Python runner refactored to FastAPI server with async generators
  • Frontend React hooks for AG-UI event consumption
  • Build metadata tracking across all components
  • Legacy migration and message compaction

Issues by Severity

🚫 Blocker Issues

None - All critical patterns are followed correctly.

🔴 Critical Issues

1. Context Leak in Background Goroutine (agui_proxy.go:145)

// Line 145: Uses context.Background() instead of request context
go func() {
    ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
    defer cancel()
    // ... long-running HTTP request to runner
}()

Issue: The comment justifies this as "detached from client request lifecycle", but this creates unbounded goroutines that persist even if the client disconnects. With a 2-hour timeout, this could accumulate hundreds of goroutines under heavy load.

Recommendation: Use request context with appropriate cleanup:

ctx, cancel := context.WithTimeout(c.Request.Context(), 2*time.Hour)
defer cancel()
// Add run state tracking to detect orphaned goroutines

2. Missing Error Handling in Auto-Execute (main.py:150)

# Line 150: File ends mid-function
async def auto_execute_initial_prompt(prompt: str, session_id: str):
    # ... setup code
    payload = {...}
    # MISSING: actual HTTP POST and error handling

Issue: The auto_execute_initial_prompt function appears truncated. No HTTP request is made, no error handling exists. This will silently fail to execute initial prompts.

Recommendation: Complete the implementation with try/except and logging.

🟡 Major Issues

1. Race Condition in Run State Lookup (agui.go:114-122)

// Lines 114-122: Read lock released before using activeRunState
aguiRunsMu.RLock()
for _, state := range aguiRuns {
    if state.SessionID == sessionID && state.Status == "running" {
        activeRunState = state
        break
    }
}
aguiRunsMu.RUnlock()  // Lock released here

// activeRunState could be deleted/modified by another goroutine here
if activeRunState == nil {
    // ... use activeRunState without lock protection
}

Issue: After releasing the read lock, activeRunState could be modified by another goroutine before being used. This violates the established locking pattern.

Recommendation: Either keep the lock held longer or copy the data under lock.

2. Unbounded Channel in AGUIRunState (agui.go:50)

ch := make(chan *types.BaseEvent, 100)

Issue: Hard-coded buffer size of 100. Under high event throughput (e.g., large file reads, rapid tool calls), this could block broadcasts or drop events silently.

Recommendation: Either increase buffer size (1000+) or add monitoring/metrics for channel saturation.

3. Missing RBAC Check in Interrupt Endpoint (agui_proxy.go - not shown)

Issue: Based on proxy pattern, HandleAGUIInterrupt likely needs the same RBAC check as HandleAGUIRunProxy. Cannot verify without seeing full implementation, but this is a common oversight.

Recommendation: Verify all AG-UI endpoints use GetK8sClientsForRequest and perform RBAC checks.

4. Frontend Type Safety Violation (use-agui-stream.ts:184)

// Line 184: Using crypto.randomUUID() without fallback
const messageId = newState.currentMessage.id || crypto.randomUUID();

Issue: crypto.randomUUID() is not universally supported (requires secure context in older browsers). The codebase pattern uses third-party UUID libraries.

Recommendation: Import and use uuid library consistent with project standards.

🔵 Minor Issues

1. Inconsistent Logging Format (agui.go:109)

log.Printf("AGUI: Event missing type field, skipping")

Pattern Violation: Backend logs should include context (sessionID, runID) for debugging. See error handling patterns.

Recommendation:

log.Printf("AGUI: Event missing type field for session %s, skipping", sessionID)

2. Magic Numbers Without Constants (agui.go:197)

go scheduleRunCleanup(runID, 5*time.Minute)

Recommendation: Extract to constant:

const RunCleanupDelay = 5 * time.Minute
go scheduleRunCleanup(runID, RunCleanupDelay)

3. Inconsistent Build Metadata Handling (Makefile:44-50)

The Makefile uses different variable names (GIT_COMMIT_SHORT) than Dockerfiles expect (GIT_COMMIT). This works because only GIT_COMMIT is passed, but GIT_COMMIT_SHORT is computed and logged but never used by containers.

Recommendation: Remove unused GIT_COMMIT_SHORT or document why it exists.

4. Removed Dependency Not Explained (go.mod:13)

-	github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674

Issue: Removed without migration notes. While obvious from context (WebSocket → AG-UI migration), this should be documented in PR description or commit message.

5. Frontend: Missing Loading/Error States (use-agui-stream.ts)

The hook tracks isConnected and isStreaming, but doesn't expose connectionError separately from state.error (which is run-level error). Consumers may want to distinguish connection failures from run failures.

Recommendation: Add connectionStatus: 'connecting' | 'connected' | 'disconnected' | 'error' to return type.

Positive Highlights

Excellent Security Practices

  • All backend endpoints properly use GetK8sClientsForRequest (agui_proxy.go:32-37)
  • RBAC checks performed before operations (agui_proxy.go:41-58)
  • No token leaks in logs

Proper Error Handling Patterns

  • Structured error responses with appropriate HTTP status codes
  • Errors logged with context before returning
  • IsNotFound handled gracefully (doesn't crash)

Type Safety Enforced

  • Backend uses unstructured.Nested* helpers (not shown but implied)
  • Frontend TypeScript types are comprehensive (types/agui.ts)
  • No any types in frontend code

Clean Architecture

  • Backend proxy pattern cleanly separates concerns
  • Python adapter uses async generators (modern, testable)
  • Frontend hooks follow React Query patterns

Comprehensive AG-UI Implementation

  • Full event taxonomy (300 lines of type definitions)
  • Message compaction for long sessions
  • Legacy migration support for backward compatibility

Build Metadata Tracking

  • Git commit/branch/version embedded via ldflags (Dockerfile:22-33)
  • Traceability for debugging production issues
  • Consistent across all 4 components

Recommendations

Priority 1 (Before Merge):

  1. Fix context leak in background goroutine (agui_proxy.go:145)
  2. Complete auto_execute_initial_prompt implementation (main.py:150)
  3. Verify RBAC checks on all AG-UI endpoints

Priority 2 (Can Address Later):

  1. Review run state locking pattern for race conditions
  2. Increase channel buffer sizes or add monitoring
  3. Add frontend connection status tracking

Priority 3 (Nice-to-Have):

  1. Extract magic numbers to constants
  2. Add comments explaining WebSocket dependency removal
  3. Standardize logging format with context

Testing Recommendations

Before merging:

  • Test with 10+ concurrent runs (verify no goroutine leak)
  • Test initial prompt auto-execution
  • Test interrupt functionality across all endpoints
  • Verify frontend handles disconnects/reconnects gracefully

Follow-up:

  • Load test with 100+ events/second (verify channel buffer size)
  • Test with 1000+ message sessions (verify compaction works)
  • E2E test covering full AG-UI protocol lifecycle

Overall Assessment: This is a well-architected migration from WebSocket to AG-UI protocol. The code follows project standards closely, with proper authentication, error handling, and type safety. The critical issues (context leak, incomplete function) are fixable pre-merge. The major issues (race conditions, channel sizing) are manageable risks that can be monitored.

Recommendation: Approve with changes - Address Priority 1 items before merge.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Removed the `sendWebSocketMessageViaBackend` function from `sessions.go`, streamlining the message handling process.
- Eliminated the no-op `SendMessageToSession` function in `test_helpers_test.go`, simplifying test setup.

These changes enhance code clarity and maintainability by removing outdated message handling logic.
@github-actions
Copy link
Contributor

github-actions bot commented Dec 18, 2025

Claude Code Review

Summary

This PR implements the AG-UI protocol for real-time event streaming, introducing a major architectural improvement to message handling. The implementation demonstrates solid security practices with proper authentication and RBAC enforcement throughout. However, there are critical type safety issues that must be addressed before merge.

Overall Assessment: APPROVE WITH REQUIRED CHANGES
Code Quality Score: 8/10


Issues by Severity

🚫 Blocker Issues

None - All blocking security and authentication patterns are correctly implemented.


🔴 Critical Issues (MUST FIX)

1. Backend: Unsafe Type Assertions Violate Type Safety Standard

Location: components/backend/websocket/agui.go:743, 759
Violates: CLAUDE.md Rule #4 (Type-Safe Unstructured Access)

// ❌ UNSAFE: Direct type assertion without checking
if spec, ok := item.Object["spec"].(map[string]interface{}); ok {
    // ...
}

Required Fix: Use unstructured.NestedMap() helpers as done correctly in agui_proxy.go:438

// ✅ CORRECT pattern (already used in agui_proxy.go)
spec, found, err := unstructured.NestedMap(item.Object, "spec")
if err \!= nil || \!found {
    log.Printf("AGUI: Failed to get spec for %s/%s", projectName, sessionName)
    return map[string]interface{}{"phase": "Unknown"}, nil
}

Why It Matters: Type assertions can panic if structure changes. This violates the established type-safety pattern used throughout the codebase.


2. Backend: Panic Recovery Without Error Event to Client

Location: components/backend/websocket/agui.go:577-584
Violates: CLAUDE.md error handling pattern

// ❌ BAD: Panic caught silently, client sees no error
defer func() {
    if r := recover(); r \!= nil {
        log.Printf("AGUI: panic in sendInitialSyncEvents: %v", r)
    }
}()

Required Fix: Handle errors explicitly, emit error events to client

// ✅ CORRECT: Explicit error handling
if err := sendInitialSyncEvents(c, runState, projectName, sessionName); err \!= nil {
    log.Printf("AGUI: Failed to send initial sync events: %v", err)
    errorEvent := &types.RunErrorEvent{
        BaseEvent: types.NewBaseEvent(types.EventTypeRunError, runState.ThreadID, runState.RunID),
        Error:     fmt.Sprintf("Failed to initialize stream: %v", err),
    }
    writeSSEEvent(c.Writer, errorEvent)
    return
}

Why It Matters: Silent failures leave clients in undefined state. Per CLAUDE.md: "Never panic in production code" - use explicit error returns.


3. Frontend: Use of any Type

Location: components/frontend/src/hooks/use-agui-stream.ts
Violates: CLAUDE.md Frontend Rule #1 (Zero any Types)

Required Fix: Replace all any with proper types or unknown with type guards. The PR already defines comprehensive AG-UI types in types/agui.ts - use them.


🟡 Major Issues

4. Backend: Context.Background() in Long-Running Goroutine

Location: components/backend/websocket/agui_proxy.go:147
Issue: Potential goroutine leak without cleanup mechanism

// Line 147 - Detached context with 2-hour timeout
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Hour)
defer cancel()

Concerns:

  • No tracking of active background goroutines for debugging
  • No cleanup when session CR is deleted
  • updateRunStatus() errors silently ignored (lines 153, 168, 176)

Recommended Fix: Add goroutine tracking similar to operator's monitoredJobs pattern

// Track active proxy goroutines for cleanup
var (
    activeProxyGoroutines   = make(map[string]context.CancelFunc)
    activeProxyGoroutinesMu sync.RWMutex
)

// Register goroutine
activeProxyGoroutinesMu.Lock()
activeProxyGoroutines[runID] = cancel
activeProxyGoroutinesMu.Unlock()

// Cleanup on completion
defer func() {
    activeProxyGoroutinesMu.Lock()
    delete(activeProxyGoroutines, runID)
    activeProxyGoroutinesMu.Unlock()
}()

5. Backend: No Event Count Limits

Location: components/backend/websocket/agui.go:331
Issue: Loads ALL events without pagination

// Line 331: loads entire history - could be thousands of events
allEvents := loadEventsForRun(sessionName, "")

Recommendation: Add event limit or streaming window to prevent memory exhaustion in long-running sessions.


🔵 Minor Issues

6. Large Hook File

Location: components/frontend/src/hooks/use-agui-stream.ts (791 lines)

Suggestion: Consider splitting into smaller composable hooks for better maintainability. Not blocking, but would improve code organization.


7. Missing Test Coverage

Gaps:

  • No tests for event compaction logic (compaction.go)
  • No tests for legacy migration (legacy_translator.go)
  • No RBAC permission tests for new endpoints

Recommendation: Add in follow-up PR, especially for complex compaction state machine.


Positive Highlights

✅ Excellent Security Implementation (10/10)

All new endpoints properly implement authentication and RBAC:

  1. HandleAGUIEvents (agui.go:492-524):

    • ✅ Uses GetK8sClientsForRequest() for user-scoped client
    • ✅ Returns 401 if token invalid
    • ✅ Performs RBAC check before streaming
  2. HandleAGUIRunProxy (agui_proxy.go:27-58):

    • ✅ User token authentication
    • ✅ Checks update verb (appropriate for run creation)
    • ✅ Returns 403 if unauthorized
  3. All other endpoints (Interrupt, History, Runs):

    • ✅ Consistent auth/RBAC pattern throughout

Verdict: Security implementation perfectly follows CLAUDE.md standards.


✅ Clean Architecture (9/10)

  1. Proper separation of concerns:

    • Backend routes → Runner FastAPI → Claude SDK
    • Event streaming via SSE (correct AG-UI pattern)
    • Compact-on-read strategy eliminates complexity
  2. Backward compatibility:

    • Legacy migration in legacy_translator.go
    • Automatic conversion of old message format
    • No breaking changes to existing sessions
  3. Operator improvements:

    • Uses unstructured.NestedMap() correctly (lines 121, 438)
    • Proper error handling with IsNotFound() checks
    • StatusPatch pattern for batched updates

✅ Comprehensive Type Definitions

Frontend types (types/agui.ts):

  • Complete AG-UI protocol event types
  • Proper use of type over interface (follows standard)
  • Union types for type-safe event discrimination

✅ Build Metadata Enhancement

Makefile + Dockerfiles:

  • Embedded git metadata via ldflags
  • Proper cache invalidation with labels
  • Good observability for production debugging

Recommendations

Must Fix Before Merge (Critical)

  1. ✅ Fix unsafe type assertions in agui.go:743, 759 → use unstructured.NestedMap()
  2. ✅ Remove panic recovery in agui.go:577 → explicit error handling with event emission
  3. ✅ Replace any types in use-agui-stream.ts → use defined AG-UI types

Should Fix Before Merge (Major)

  1. ⚠️ Add goroutine tracking for background proxy streams
  2. ⚠️ Add event count limits to prevent memory exhaustion

Can Fix in Follow-up PR

  1. 📝 Add test coverage for compaction logic
  2. 📝 Add test coverage for legacy migration
  3. 📝 Create ADR documenting AG-UI adoption decision
  4. 📝 Split large hooks into smaller composable functions

Code Quality Breakdown

Category Score Notes
Security 10/10 Perfect auth/RBAC implementation
Error Handling 7/10 Good patterns, some silent failures
Type Safety 6/10 Unsafe assertions, any usage
Architecture 9/10 Clean separation, good design
Testing 4/10 Missing critical test coverage
Overall 8/10 Solid implementation with fixable issues

Conclusion

This PR represents a significant architectural improvement with proper security and thoughtful design. The AG-UI protocol integration is well-executed and follows established patterns from the codebase.

Fix the 3 critical issues (type safety, panic recovery, any types), then this is ready to merge. The goroutine lifecycle management should be addressed to prevent resource leaks in production.

Great work on maintaining security standards throughout! 🎉


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

@Gkrumbach07 Gkrumbach07 merged commit ab2514c into ambient-code:main Dec 18, 2025
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant