Skip to content

Conversation

@Gkrumbach07
Copy link
Collaborator

Problem

When multiple sessions exist in the same namespace (project), stopping one session would delete the ambient-vertex and ambient-admin-langfuse-secret secrets from the namespace, even if other sessions were still Running, Creating, or Pending and needed those secrets.

This caused the error:

MountVolume.SetUp failed for volume "vertex" : secret "ambient-vertex" not found

Root Cause

The deleteAmbientVertexSecret and deleteAmbientLangfuseSecret functions were called during session cleanup without checking if other sessions in the same namespace still needed these shared secrets.

Solution

Added a check before deleting these shared secrets to count active sessions (Running, Creating, or Pending) in the namespace. If any active sessions exist, the secret deletion is skipped with a log message.

Testing

  • Deployed fix to stage cluster
  • Verified existing sessions can now start without the FailedMount error
  • Operator logs show proper skipping behavior when active sessions exist

Related to PR #494

@github-actions

This comment has been minimized.

- Introduced a new endpoint for retrieving MCP server statuses in the backend, allowing the frontend to display real-time integration statuses.
- Updated the `SelectWorkflow` function to handle MCP status checks during workflow selection.
- Enhanced the `McpIntegrationsAccordion` component to fetch and display MCP server statuses dynamically.
- Removed deprecated model configuration and session creation components to streamline the codebase.

These changes improve the visibility of MCP integrations and enhance user experience by providing real-time feedback on server statuses.
When multiple sessions exist in the same namespace, the operator would
delete the ambient-vertex and ambient-admin-langfuse-secret when one
session stopped, even if other sessions were still Running, Creating,
or Pending and needed those secrets.

This fix adds a check before deleting these shared secrets to count
active sessions in the namespace. If any sessions are still active,
the secret deletion is skipped.

Fixes the MountVolume.SetUp failed error:
  'secret "ambient-vertex" not found'

Related to PR #494
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Claude Code Review

Summary

This PR fixes a race condition where stopping one session would delete shared secrets (ambient-vertex and ambient-admin-langfuse-secret) needed by other active sessions in the same namespace. The core fix in the operator is solid and addresses the root cause. However, the PR includes significant scope creep with OAuth integration, MCP server status, and UI refactoring that should have been separate PRs. The code quality is generally good, but there are several security, architecture, and frontend pattern violations that need addressing.

Issues by Severity

🚫 Blocker Issues

None - The core race condition fix is sound and can be merged after addressing critical issues below.

🔴 Critical Issues

  1. Security: OAuth State Parameter Not Validated (oauth.go:150-200)

    • HMAC signature is generated but never validated on callback
    • CSRF vulnerability: Attackers can forge state parameters
    • Fix Required: Validate HMAC in OAuthCallback before processing code
    • Pattern: security-standards.md - Always validate signed parameters
  2. Frontend: Multiple any Type Violations

    • Uses any in event handlers
    • Violates frontend-development.md - Zero any types rule
    • Fix Required: Use proper types like React.MouseEvent
  3. Backend: Token/Secret Handling in OAuth Flow (oauth.go:300-350)

    • OAuth tokens stored in secrets without expiration tracking
    • No refresh token rotation logic
    • Risk: Stale tokens, no revocation mechanism
    • Fix Required: Add token expiration and refresh logic
  4. Operator: Incomplete Active Session Check (sessions.go:2145-2157)

    • Only checks Running, Creating, Pending phases
    • Missing: Sessions in Restarting or Stopping states may still need secrets
    • Fix Required: Include transition states in active count

🟡 Major Issues

  1. Architecture: OAuth Should Be Project-Level, Not Session-Level

    • Current implementation stores OAuth tokens per-session
    • Problem: User must re-authenticate for each session
    • Better Design: Store in project namespace, share across sessions
  2. Frontend: React Query Pattern Violations

    • google-drive-connection-card.tsx manually polls with setInterval
    • Should use React Query refetchInterval per react-query-usage.md
  3. Scope Creep: OAuth Integration Belongs in Separate PR

    • 418 lines added to oauth.go for feature unrelated to race condition fix
    • Impact: Makes review difficult, mixes concerns

🔵 Minor Issues

  1. Code Duplication: Secret deletion logic duplicated between vertex and langfuse functions
  2. Missing test coverage for race condition fix
  3. Frontend missing loading states on buttons
  4. Logging inconsistencies

Positive Highlights

✅ Excellent Root Cause Analysis - PR description clearly explains problem and solution
✅ Defensive Coding - Checks for copied secret annotation before deletion
✅ Good Error Handling - Uses errors.IsNotFound correctly throughout
✅ Proper Logging - Includes active session count in log messages
✅ Follows Operator Patterns - Uses unstructured.NestedMap helpers

Verdict

✅ APPROVE with changes - The core race condition fix (operator changes) is correct and well-implemented. Address critical issues 1-4 before merge. Consider extracting OAuth work to follow-up PR.

Review based on: CLAUDE.md, backend-development.md, frontend-development.md, security-standards.md, k8s-client-usage.md, error-handling.md, react-query-usage.md


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Updated the `McpIntegrationsAccordion` component to display authentication status for MCP servers, including new icons and badges for authenticated and unauthenticated states.
- Modified the backend to include authentication checks for known MCP servers (Google Workspace and Jira) and return relevant messages.
- Added a new `webfetch` server configuration to the `.mcp.json` file for improved web content extraction.
- Refactored the session API to provide clearer information on server authentication status in the response.

These changes improve user awareness of MCP server states and enhance the overall integration experience.
@Gkrumbach07 Gkrumbach07 force-pushed the fix/vertex-secret-race-condition branch from d8a0847 to c88998a Compare January 7, 2026 04:29
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Claude Code Review

Summary

This PR fixes a critical race condition in shared secret deletion and enhances MCP integration UI with authentication status. The core fix is solid, but there are several issues that need attention before merge.

Issues by Severity

🚫 Blocker Issues

1. Duplicate Route Registration (routes.go:93-96)

// Lines 93-96 have duplicate registration
projectGroup.GET("/agentic-sessions/:sessionName/mcp/status", websocket.HandleMCPStatus)
// MCP status endpoint
projectGroup.GET("/agentic-sessions/:sessionName/mcp/status", websocket.HandleMCPStatus)

Impact: Second registration overrides the first, may cause unexpected behavior or panic on startup.
Fix: Remove one of the duplicate lines (likely lines 95-96).

🔴 Critical Issues

2. Type Safety Violation (sessions.go:2147-2181)
The operator code uses direct type assertions without checking:

if p, ok := status["phase"].(string); ok {
    phase = p
}

Issue: This pattern is correct, but the unstructured access is inconsistent with project standards.
Required pattern (per .claude/patterns/k8s-client-usage.md):

phase, found, err := unstructured.NestedString(session.Object, "status", "phase")
if !found || err != nil {
    phase = ""
}

Fix: Use unstructured.NestedString() helper for type-safe access.
Files: components/operator/internal/handlers/sessions.go (lines 2145-2153, 2204-2212)

3. Frontend Type Safety: Missing Type Definitions
The new McpServer type adds optional fields without explicit | undefined:

authenticated?: boolean;
authMessage?: string;

Issue: TypeScript optional properties are T | undefined, but best practice per frontend standards is to be explicit about nullability.
Recommended:

authenticated?: boolean | undefined;
authMessage?: string | undefined;

File: components/frontend/src/services/api/sessions.ts:24-25

🟡 Major Issues

4. Error Handling Pattern Inconsistency (sessions.go:2139-2143)

if err != nil {
    log.Printf("Warning: failed to list sessions in namespace %s, skipping secret deletion: %v", namespace, err)
    return nil // Don't delete if we can't verify no other sessions need it
}

Issue: This is actually good defensive programming (safe-fail), but the pattern deviates slightly from error-handling.md.
Recommendation: Add a comment explaining this is intentional safe-fail behavior:

// Safe-fail: If we can't list sessions, assume some might need the secret
log.Printf("Warning: failed to list sessions in namespace %s, skipping secret deletion: %v", namespace, err)
return nil

5. Missing Loading State Handling (mcp-integrations-accordion.tsx:29)
The useMcpStatus hook doesn't extract isLoading or error:

const { data: mcpStatus } = useMcpStatus(projectName, sessionName)

Per frontend standards (CLAUDE.md:1130): "All buttons have loading states"
Fix: Handle loading/error states:

const { data: mcpStatus, isLoading, error } = useMcpStatus(projectName, sessionName)

if (isLoading) return <div>Loading MCP status...</div>
if (error) return <div>Error loading MCP servers</div>

6. Python Type Hints Missing (main.py:287)

def _check_mcp_authentication(server_name: str) -> tuple[bool | None, str | None]:

Issue: Modern Python (3.10+) should use tuple[bool | None, str | None] but mypy may prefer Optional[bool] for clarity.
Per CLAUDE.md Python standards: Use explicit types.
Recommended:

from typing import Optional
def _check_mcp_authentication(server_name: str) -> tuple[Optional[bool], Optional[str]]:

🔵 Minor Issues

7. Log Message Clarity (sessions.go:2160)

log.Printf("Skipping %s secret deletion in namespace %s: %d active session(s) may still need it", ...)

Suggestion: Add phase names for debugging:

log.Printf("Skipping %s secret deletion in namespace %s: %d active session(s) (Running/Creating/Pending) still need it", ...)

8. Unused Icon Import (mcp-integrations-accordion.tsx:3)

import { Plug, CheckCircle2, XCircle, AlertCircle, KeyRound, KeyRoundIcon } from 'lucide-react'

Issue: Plug, XCircle, AlertCircle appear unused based on the diff.
Fix: Remove unused imports to keep bundle size minimal.

9. Comment Duplication (sessions.go:2117-2118, 2173-2174)
Both functions have identical comments - consider extracting to a shared helper.

10. Python Path Check Duplication (main.py:300-306)

for cred_path in [workspace_path, secret_path]:
    if cred_path.exists():
        try:
            if cred_path.stat().st_size > 0:

Suggestion: Extract to helper function:

def _check_credential_file(path: Path) -> bool:
    return path.exists() and path.stat().st_size > 0

Positive Highlights

Excellent Problem Identification: The race condition fix directly addresses a critical bug with clear root cause analysis.

Safe-Fail Design: Both secret deletion functions fail-safe (skip deletion if uncertain) - this is the correct approach for shared resources.

Comprehensive Solution: The fix checks all relevant session phases (Running, Creating, Pending) rather than just Running.

Good Logging: Clear log messages with context (namespace, active count) aid debugging.

Type-Safe Fallbacks (Frontend): The UI gracefully handles missing authenticated/authMessage fields.

Clear Authentication Status: The MCP UI enhancement provides valuable user feedback about credential state.

Proper Tooltip Implementation: Uses Shadcn UI components (Tooltip, TooltipProvider) correctly.

Recommendations

Immediate (Before Merge)

  1. Fix blocker: Remove duplicate route registration (routes.go:95-96)
  2. Fix critical type safety: Use unstructured.NestedString() in sessions.go
  3. Add loading states: Handle isLoading and error in MCP accordion

High Priority (Before Merge)

  1. Add safe-fail comment to operator error handling
  2. Remove unused icon imports
  3. Use Optional type hints in Python

Nice-to-Have (Can Address in Follow-up)

  1. Extract duplicate secret deletion logic to shared helper
  2. Extract credential file check to helper function
  3. Improve log message clarity with phase names

Testing Checklist

  • Verify no duplicate route panics on backend startup
  • Test with 2+ concurrent sessions in same namespace
  • Verify secrets are NOT deleted when sessions are active
  • Verify secrets ARE deleted when last session completes
  • Test MCP UI with authenticated/unauthenticated servers
  • Test MCP UI loading/error states

Architecture Alignment

✅ Follows operator reconciliation patterns (error-handling.md)
✅ Uses proper error logging with context (CLAUDE.md:862)
⚠️ Type safety could be improved (k8s-client-usage.md)
⚠️ Frontend loading states missing (DESIGN_GUIDELINES.md)

Security Review

✅ No token logging
✅ Safe-fail on uncertain state
✅ No RBAC violations
✅ Proper error handling without exposing internals


Overall Assessment: Strong fix with good architectural choices, but needs type safety improvements and duplicate route removal before merge.


🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

- Eliminated redundant registration of the MCP status endpoint in the routes.go file, streamlining the route definitions.
- This change enhances code clarity and maintainability without altering existing functionality.
@Gkrumbach07 Gkrumbach07 merged commit 219f445 into main Jan 7, 2026
6 checks passed
@Gkrumbach07 Gkrumbach07 deleted the fix/vertex-secret-race-condition branch January 7, 2026 04:34
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

Claude Code Review

Summary

This PR fixes a critical race condition where shared secrets (ambient-vertex and ambient-admin-langfuse-secret) were being deleted when one session stopped, even if other active sessions still needed them. The fix adds active session counting before secret deletion. The PR also includes MCP integration enhancements.

Overall Assessment: The operator fix is well-implemented and follows established patterns. The frontend/runner changes have minor issues that should be addressed.

Issues by Severity

🚫 Blocker Issues

None - No blocking issues found.

🔴 Critical Issues

1. Frontend: Missing Type Definition in API Types

  • File: components/frontend/src/services/api/sessions.ts:24-25
  • Issue: Added authenticated?: boolean and authMessage?: string to McpServer type using interface syntax
  • Standard Violation: Frontend guidelines require using type over interface
  • Impact: Inconsistent with project standards (CLAUDE.md Frontend Rule Epic: AI Agent Development #4)
  • Fix:
// Current (violates standard)
export type McpServer = {
  // ...
  authenticated?: boolean;  // These should be part of the type definition
  authMessage?: string;
}

// Should use 'type' consistently throughout (already using type, so this is OK)
// But note: The actual file uses 'type' so this is NOT a blocker
  • Re-assessment: Upon closer inspection, the file already uses type, so this is actually following standards. No issue here.

🟡 Major Issues

1. Operator: Code Duplication in Secret Deletion Functions

  • File: components/operator/internal/handlers/sessions.go:2118-2195 and :2174-2225
  • Issue: Nearly identical logic duplicated in deleteAmbientVertexSecret and deleteAmbientLangfuseSecret (58 lines total, ~40 lines duplicated)
  • Impact: Maintenance burden, potential for bugs if one function is updated but not the other
  • Recommendation: Extract common logic into a helper function:
// Suggested refactor
func countActiveSessions(ctx context.Context, namespace string) (int, error) {
    gvr := types.GetAgenticSessionResource()
    sessions, err := config.DynamicClient.Resource(gvr).Namespace(namespace).List(ctx, v1.ListOptions{})
    if err \!= nil {
        return -1, err
    }
    
    activeCount := 0
    for _, session := range sessions.Items {
        status, _, _ := unstructured.NestedMap(session.Object, "status")
        phase := ""
        if status \!= nil {
            if p, ok := status["phase"].(string); ok {
                phase = p
            }
        }
        if phase == "Running" || phase == "Creating" || phase == "Pending" {
            activeCount++
        }
    }
    return activeCount, nil
}

func deleteSharedSecretIfSafe(ctx context.Context, namespace, secretName string) error {
    // Check if secret was copied (has the label)
    // Check active session count
    // Delete if safe
}

2. Runner: Hardcoded Paths in Authentication Check

  • File: components/runners/claude-code-runner/main.py:297-308
  • Issue: Google Workspace credential paths are hardcoded (/app/.google_workspace_mcp/credentials/credentials.json, /workspace/.google_workspace_mcp/credentials/credentials.json)
  • Impact: Brittle, breaks if mount paths change
  • Recommendation: Use environment variables or constants:
GOOGLE_WORKSPACE_SECRET_PATH = os.getenv("GOOGLE_WORKSPACE_SECRET_PATH", "/app/.google_workspace_mcp/credentials")
GOOGLE_WORKSPACE_WORKSPACE_PATH = os.getenv("GOOGLE_WORKSPACE_WORKSPACE_PATH", "/workspace/.google_workspace_mcp/credentials")

3. Runner: Incomplete Error Handling in File Existence Check

  • File: components/runners/claude-code-runner/main.py:304-307
  • Issue: Bare except OSError: pass silently ignores errors when checking file size
  • Impact: Permissions errors, I/O errors, etc. are silently ignored
  • Pattern Violation: Error handling pattern requires logging errors (see .claude/patterns/error-handling.md)
  • Recommendation:
try:
    if cred_path.stat().st_size > 0:
        return True, "Google OAuth credentials available"
except OSError as e:
    logger.warning(f"Failed to check credentials at {cred_path}: {e}")
    continue  # Try next path

🔵 Minor Issues

1. Operator: Inconsistent Logging Verbosity

  • File: components/operator/internal/handlers/sessions.go:2160 and :2217
  • Issue: Log messages differ slightly between the two secret deletion functions ("Skipping %s secret deletion" vs same message)
  • Impact: Minor - just consistency
  • Recommendation: Use consistent log format across both functions

2. Frontend: Unused Import

  • File: components/frontend/src/app/projects/[name]/sessions/[sessionName]/components/accordions/mcp-integrations-accordion.tsx:1
  • Issue: Imported KeyRoundIcon alongside KeyRound from lucide-react - they're the same icon
  • Impact: Minimal bundle size increase, confusing naming
  • Fix: Remove KeyRoundIcon import, use KeyRound consistently (line 60 uses KeyRoundIcon)

3. Runner: Missing Type Hints

  • File: components/runners/claude-code-runner/main.py:287
  • Issue: _check_mcp_authentication function has return type hint but parameter type hint missing
  • Recommendation:
def _check_mcp_authentication(server_name: str) -> tuple[bool | None, str | None]:
  • Re-check: Function signature is correct on line 287. No issue here.

4. Frontend: Redundant Conditional Logic

  • File: mcp-integrations-accordion.tsx:31-44
  • Issue: getStatusIcon checks server.authenticated \!== undefined then uses it, but could be simplified
  • Impact: Minor readability issue
  • Current approach is fine for clarity, not worth changing

Positive Highlights

Excellent Operator Fix

  • Properly uses unstructured.NestedMap for type-safe access (follows CLAUDE.md patterns)
  • Correctly returns nil when session list fails (fails safe - doesn't delete secrets if unsure)
  • Good error handling: logs warning but doesn't treat list failure as fatal
  • Uses existing GVR helper: types.GetAgenticSessionResource()

Proper Phase Checking

  • Correctly identifies active sessions: "Running", "Creating", "Pending"
  • Doesn't count "Completed", "Error", "Failed" sessions

Good Logging

  • Informative log messages with context (namespace, active count)
  • Helps debugging: "Skipping deletion: 2 active session(s) may still need it"

Follows Security Pattern

  • Uses config.K8sClient (service account) appropriately for cluster-wide secret operations
  • No token handling violations

Frontend UX Improvements

  • Authentication status badges provide clear visual feedback
  • Tooltips enhance user understanding of MCP server states
  • Proper use of Shadcn UI components (Badge, Tooltip)

MCP Auth Check Logic

  • Returns (None, None) for unknown servers (honest about what we can verify)
  • Specific checks for Google Workspace and Jira (only servers we know how to check)

Recommendations

Priority 1 (Address Before Merge)

  1. Refactor Duplicate Logic

    • Extract active session counting into shared helper function
    • Reduces maintenance burden and potential for divergence
  2. Improve Error Handling in Runner

    • Log OSError when checking credential file size
    • Follows project error handling patterns

Priority 2 (Address Soon)

  1. Make Paths Configurable

    • Use environment variables for Google Workspace credential paths
    • Improves testability and flexibility
  2. Clean Up Imports

    • Remove duplicate KeyRoundIcon import in frontend

Priority 3 (Nice to Have)

  1. Add Unit Tests

    • Test countActiveSessions helper (if extracted)
    • Test _check_mcp_authentication with various scenarios
  2. Consider Edge Cases

    • What if a session is in "Unknown" or future phase states?
    • Should we count those as active to be safe?

Architecture Alignment

Backend/Operator Standards: Follows established patterns

  • Proper use of unstructured.NestedMap (not direct type assertions)
  • Service account client used appropriately
  • Error handling follows patterns (log warnings, fail safe)

Frontend Standards: Mostly follows guidelines

  • Uses Shadcn UI components exclusively
  • React Query integration (via useMcpStatus hook)
  • Minor import cleanup needed

⚠️ Code Organization: Minor duplication issue

  • Consider extracting shared logic per DRY principle

Testing Notes

The PR mentions stage cluster testing:

  • ✅ Verified existing sessions start without FailedMount error
  • ✅ Operator logs show proper skipping behavior

Recommended additional testing:

  1. Create multiple sessions in same namespace
  2. Stop one session while others are Running
  3. Verify secrets remain
  4. Stop all sessions
  5. Verify secrets are deleted
  6. Test edge case: session in "Unknown" phase

Conclusion

Recommendation: APPROVE with minor changes

The core fix for the race condition is solid and follows all critical backend/operator patterns. The code duplication should be addressed to prevent future maintenance issues, but doesn't block the merge. The frontend and runner changes are good UX improvements with minor cleanup needed.

Priority fixes before merge:

  • Extract duplicate active session counting logic
  • Add error logging in credential file check
  • Remove duplicate icon import

Can be addressed in follow-up:

  • Make paths configurable via environment variables
  • Add unit tests for new helpers

🔍 View AI decision process (logs available for 90 days)

📋 View memory system files loaded (click to expand)

What Amber Loaded for Code Review

Amber automatically loaded these repository standards from the memory system:

  1. CLAUDE.md - Master project instructions, development standards
  2. backend-development.md - Go backend, K8s integration patterns
  3. frontend-development.md - NextJS, Shadcn UI, React Query patterns
  4. security-standards.md - Auth, RBAC, token handling
  5. k8s-client-usage.md - User token vs service account patterns
  6. error-handling.md - Consistent error patterns
  7. react-query-usage.md - Data fetching patterns

Impact: This review used your repository's specific code quality standards, security patterns, and best practices from the memory system (PRs #359, #360) - not just generic code review guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants