From 193b0340c6c257f95d40a96ea08515d24078b6dd Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Sat, 20 Dec 2025 07:36:43 -0600 Subject: [PATCH 1/6] feat: Enhance session handling and observability improvements - Refactored session management to improve clarity and efficiency, including the removal of self-referential parent-session-id annotations. - Updated session workspace path handling to be relative to the content service's StateBaseDir, simplifying path management. - Introduced graceful shutdown for the content service, enhancing reliability during server termination. - Enhanced observability stack with new Grafana dashboard configurations and metrics for session lifecycle tracking. - Cleaned up unused code and improved logging for better debugging and maintenance. chore: Update .gitignore and remove obsolete deployment documentation - Added build log and log file patterns to .gitignore to prevent accidental commits. - Deleted outdated deployment documentation files: DEPLOYMENT_CHANGES.md, DIFF_IMPROVEMENTS.md, S3_MIGRATION_GAPS.md, and OPENSHIFT_SETUP.md, which are no longer relevant to the current architecture. - Cleaned up observability-related files, including Grafana and Prometheus configurations, to streamline the observability stack. feat: Enhance operator metrics and session handling - Introduced Prometheus metrics for monitoring session lifecycle, including startup duration, phase transitions, and error tracking. - Updated session handling to record metrics during reconciliation, including session creation and completion. - Refactored session management logic to ensure consistent behavior across API and kubectl session creations. - Increased QPS and Burst settings for Kubernetes client to improve performance under load. - Added a new Service and ServiceMonitor for exposing operator metrics in the ambient-code namespace. feat: Refactor AgenticSession handling to use Pods instead of Jobs - Updated the operator to create and manage Pods directly for AgenticSessions, improving startup speed and reducing complexity. - Changed environment variable references and logging to reflect the transition from Jobs to Pods. - Adjusted cleanup logic to handle Pods appropriately, including service creation and monitoring. - Modified deployment configurations to ensure compatibility with the new Pod-based architecture. feat: Implement S3 storage configuration for session artifacts - Added support for S3-compatible storage in the settings section, allowing users to configure S3 endpoint, bucket, region, access key, and secret key. - Updated the operator to persist session state and artifacts in S3, replacing the previous temporary content pod mechanism. - Removed deprecated references to temporary content pods and PVCs, transitioning to an EmptyDir storage model with S3 integration. - Enhanced the operator's handling of S3 configuration, ensuring proper validation and logging for S3 settings. - Updated Makefile to include new build targets for state-sync image and MinIO setup. feat: Enhance operator deployment with controller-runtime features - Added command-line arguments for metrics and health probe endpoints, enabling better observability. - Implemented concurrent reconciliation with a configurable maximum, improving performance. - Updated Dockerfile to use ENTRYPOINT for better argument handling. - Enhanced health checks with HTTP probes for liveness and readiness. - Updated README to reflect new configuration options and features. feat: Enhance observability stack deployment and cleanup in Makefile - Added new targets for deploying and cleaning up the observability stack, including OpenTelemetry and Grafana. - Introduced commands for accessing Grafana and Prometheus dashboards. - Updated .gitignore to include secrets template for MinIO credentials. - Removed deprecated image-prepuller DaemonSet and associated metrics service from manifests. - Updated Makefile to reflect changes in observability management and improve user experience. refactor: Clean up observability stack and enhance session handling - Removed obsolete observability stack deployment commands from Makefile. - Updated session handling in the operator to improve clarity and efficiency. - Introduced a new state sync image in deployment scripts and updated related configurations. - Refactored metrics handling for session lifecycle, ensuring consistent error tracking and performance monitoring. - Cleaned up unused code and improved readability across multiple files. feat: Refactor S3 storage configuration in settings and operator - Replaced S3_ENABLED with STORAGE_MODE to allow selection between shared and custom storage options. - Updated settings section to include radio buttons for storage mode selection, enhancing user experience. - Modified operator session handling to read and apply storage mode, ensuring proper configuration for S3 settings. - Improved logging for storage mode usage, clarifying the configuration process for users. --- .gitignore | 8 + Makefile | 61 +- components/backend/handlers/sessions.go | 645 ++----- components/backend/server/server.go | 42 +- .../[name]/sessions/[sessionName]/page.tsx | 11 - .../sessions/[sessionName]/session-header.tsx | 15 - .../src/components/session-details-modal.tsx | 116 +- .../workspace-sections/settings-section.tsx | 161 +- .../frontend/src/types/project-settings.ts | 8 + components/manifests/base/kustomization.yaml | 3 + .../minio-credentials-secret.yaml.example | 31 + .../manifests/base/minio-deployment.yaml | 102 + .../manifests/base/operator-deployment.yaml | 46 +- .../base/rbac/operator-clusterrole.yaml | 4 +- components/manifests/deploy.sh | 4 + components/manifests/observability/README.md | 191 ++ .../ambient-operator-dashboard.json | 366 ++++ .../manifests/observability/grafana.yaml | 490 +++++ .../observability/kustomization.yaml | 15 + .../observability/otel-collector.yaml | 108 ++ .../grafana-datasource-patch.yaml | 45 + .../overlays/with-grafana/kustomization.yaml | 15 + .../observability/servicemonitor.yaml | 22 + .../overlays/production/kustomization.yaml | 3 + components/operator/Dockerfile | 6 +- components/operator/README.md | 155 +- components/operator/go.mod | 34 +- components/operator/go.sum | 83 +- components/operator/internal/config/config.go | 28 + .../controller/agenticsession_controller.go | 301 +++ .../internal/controller/otel_metrics.go | 467 +++++ .../internal/controller/reconcile_phases.go | 382 ++++ .../operator/internal/handlers/helpers.go | 7 +- .../operator/internal/handlers/namespaces.go | 7 +- .../operator/internal/handlers/reconciler.go | 450 +++++ .../operator/internal/handlers/sessions.go | 1716 +++++++---------- .../internal/services/infrastructure.go | 34 +- components/operator/main.go | 191 +- .../runners/claude-code-runner/adapter.py | 284 +-- components/runners/claude-code-runner/main.py | 33 +- components/runners/state-sync/Dockerfile | 21 + components/runners/state-sync/hydrate.sh | 232 +++ components/runners/state-sync/sync.sh | 156 ++ docs/minio-quickstart.md | 297 +++ docs/operator-metrics-visualization.md | 134 ++ docs/s3-storage-configuration.md | 393 ++++ scripts/setup-minio.sh | 85 + 47 files changed, 6042 insertions(+), 1966 deletions(-) create mode 100644 components/manifests/base/minio-credentials-secret.yaml.example create mode 100644 components/manifests/base/minio-deployment.yaml create mode 100644 components/manifests/observability/README.md create mode 100644 components/manifests/observability/dashboards/ambient-operator-dashboard.json create mode 100644 components/manifests/observability/grafana.yaml create mode 100644 components/manifests/observability/kustomization.yaml create mode 100644 components/manifests/observability/otel-collector.yaml create mode 100644 components/manifests/observability/overlays/with-grafana/grafana-datasource-patch.yaml create mode 100644 components/manifests/observability/overlays/with-grafana/kustomization.yaml create mode 100644 components/manifests/observability/servicemonitor.yaml create mode 100644 components/operator/internal/controller/agenticsession_controller.go create mode 100644 components/operator/internal/controller/otel_metrics.go create mode 100644 components/operator/internal/controller/reconcile_phases.go create mode 100644 components/operator/internal/handlers/reconciler.go create mode 100644 components/runners/state-sync/Dockerfile create mode 100644 components/runners/state-sync/hydrate.sh create mode 100644 components/runners/state-sync/sync.sh create mode 100644 docs/minio-quickstart.md create mode 100644 docs/operator-metrics-visualization.md create mode 100644 docs/s3-storage-configuration.md create mode 100755 scripts/setup-minio.sh diff --git a/.gitignore b/.gitignore index 4925129cb..b84450271 100644 --- a/.gitignore +++ b/.gitignore @@ -140,3 +140,11 @@ reports/ # Security scan artifacts (transient) .security-scan/ .security-scan.zip + +# Secrets (should use .example templates) +**/minio-credentials-secret.yaml + +# Build artifacts and logs +build.log +*.log +!components/**/*.log diff --git a/Makefile b/Makefile index 13fa26ca6..987164c05 100644 --- a/Makefile +++ b/Makefile @@ -1,10 +1,11 @@ -.PHONY: help setup build-all build-frontend build-backend build-operator build-runner deploy clean +.PHONY: help setup build-all build-frontend build-backend build-operator build-runner build-state-sync deploy clean .PHONY: local-up local-down local-clean local-status local-rebuild local-reload-backend local-reload-frontend local-reload-operator local-sync-version .PHONY: local-dev-token .PHONY: local-logs local-logs-backend local-logs-frontend local-logs-operator local-shell local-shell-frontend .PHONY: local-test local-test-dev local-test-quick test-all local-url local-troubleshoot local-port-forward local-stop-port-forward .PHONY: push-all registry-login setup-hooks remove-hooks check-minikube check-kubectl .PHONY: e2e-test e2e-setup e2e-clean deploy-langfuse-openshift +.PHONY: setup-minio minio-console minio-logs minio-status .PHONY: validate-makefile lint-makefile check-shell makefile-health .PHONY: _create-operator-config _auto-port-forward _show-access-info _build-and-load @@ -36,6 +37,7 @@ FRONTEND_IMAGE ?= vteam_frontend:latest BACKEND_IMAGE ?= vteam_backend:latest OPERATOR_IMAGE ?= vteam_operator:latest RUNNER_IMAGE ?= vteam_claude_runner:latest +STATE_SYNC_IMAGE ?= vteam_state_sync:latest # Build metadata (captured at build time) GIT_COMMIT := $(shell git rev-parse HEAD 2>/dev/null || echo "unknown") @@ -91,7 +93,7 @@ help: ## Display this help message ##@ Building -build-all: build-frontend build-backend build-operator build-runner ## Build all container images +build-all: build-frontend build-backend build-operator build-runner build-state-sync ## Build all container images build-frontend: ## Build frontend image @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Building frontend with $(CONTAINER_ENGINE)..." @@ -145,6 +147,13 @@ build-runner: ## Build Claude Code runner image -t $(RUNNER_IMAGE) -f claude-code-runner/Dockerfile . @echo "$(COLOR_GREEN)✓$(COLOR_RESET) Runner built: $(RUNNER_IMAGE)" +build-state-sync: ## Build state-sync image for S3 persistence + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Building state-sync with $(CONTAINER_ENGINE)..." + @echo " Git: $(GIT_BRANCH)@$(GIT_COMMIT_SHORT)$(GIT_DIRTY)" + @cd components/runners/state-sync && $(CONTAINER_ENGINE) build $(PLATFORM_FLAG) $(BUILD_FLAGS) \ + -t vteam_state_sync:latest . + @echo "$(COLOR_GREEN)✓$(COLOR_RESET) State-sync built: vteam_state_sync:latest" + ##@ Git Hooks setup-hooks: ## Install git hooks for branch protection @@ -164,13 +173,59 @@ registry-login: ## Login to container registry push-all: registry-login ## Push all images to registry @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Pushing images to $(REGISTRY)..." - @for image in $(FRONTEND_IMAGE) $(BACKEND_IMAGE) $(OPERATOR_IMAGE) $(RUNNER_IMAGE); do \ + @for image in $(FRONTEND_IMAGE) $(BACKEND_IMAGE) $(OPERATOR_IMAGE) $(RUNNER_IMAGE) $(STATE_SYNC_IMAGE); do \ echo " Tagging and pushing $$image..."; \ $(CONTAINER_ENGINE) tag $$image $(REGISTRY)/$$image && \ $(CONTAINER_ENGINE) push $(REGISTRY)/$$image; \ done @echo "$(COLOR_GREEN)✓$(COLOR_RESET) All images pushed" +##@ MinIO S3 Storage + +setup-minio: ## Set up MinIO and create initial bucket + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Setting up MinIO for S3 state storage..." + @./scripts/setup-minio.sh + @echo "$(COLOR_GREEN)✓$(COLOR_RESET) MinIO setup complete" + +minio-console: ## Open MinIO console (port-forward to localhost:9001) + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Opening MinIO console at http://localhost:9001" + @echo " Login: admin / changeme123 (or your configured credentials)" + @kubectl port-forward svc/minio 9001:9001 -n $(NAMESPACE) + +minio-logs: ## View MinIO logs + @kubectl logs -f deployment/minio -n $(NAMESPACE) + +minio-status: ## Check MinIO status + @echo "$(COLOR_BOLD)MinIO Status$(COLOR_RESET)" + @kubectl get deployment,pod,svc,pvc -l app=minio -n $(NAMESPACE) + +##@ Observability + +deploy-observability: ## Deploy observability (OTel + OpenShift Prometheus) + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Deploying observability stack..." + @kubectl apply -k components/manifests/observability/ + @echo "$(COLOR_GREEN)✓$(COLOR_RESET) Observability deployed (OTel + ServiceMonitor)" + @echo " View metrics: OpenShift Console → Observe → Metrics" + @echo " Optional Grafana: make add-grafana" + +add-grafana: ## Add Grafana on top of observability stack + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Adding Grafana..." + @kubectl apply -k components/manifests/observability/overlays/with-grafana/ + @echo "$(COLOR_GREEN)✓$(COLOR_RESET) Grafana deployed" + @echo " Create route: oc create route edge grafana --service=grafana -n $(NAMESPACE)" + +clean-observability: ## Remove observability components + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Removing observability..." + @kubectl delete -k components/manifests/observability/overlays/with-grafana/ 2>/dev/null || true + @kubectl delete -k components/manifests/observability/ 2>/dev/null || true + @echo "$(COLOR_GREEN)✓$(COLOR_RESET) Observability removed" + +grafana-dashboard: ## Open Grafana (create route first) + @echo "$(COLOR_BLUE)▶$(COLOR_RESET) Opening Grafana..." + @oc create route edge grafana --service=grafana -n $(NAMESPACE) 2>/dev/null || echo "Route already exists" + @echo " URL: https://$$(oc get route grafana -n $(NAMESPACE) -o jsonpath='{.spec.host}')" + @echo " Login: admin/admin" + ##@ Local Development (Minikube) local-up: check-minikube check-kubectl ## Start local development environment (minikube) diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go index b413c9669..591213de8 100644 --- a/components/backend/handlers/sessions.go +++ b/components/backend/handlers/sessions.go @@ -25,13 +25,10 @@ import ( "github.com/gin-gonic/gin" authnv1 "k8s.io/api/authentication/v1" authzv1 "k8s.io/api/authorization/v1" - corev1 "k8s.io/api/core/v1" - rbacv1 "k8s.io/api/rbac/v1" "k8s.io/apimachinery/pkg/api/errors" v1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" "k8s.io/apimachinery/pkg/runtime/schema" - ktypes "k8s.io/apimachinery/pkg/types" "k8s.io/client-go/dynamic" "k8s.io/client-go/kubernetes" ) @@ -45,8 +42,6 @@ var ( // LEGACY: SendMessageToSession removed - AG-UI server uses HTTP/SSE instead of WebSocket ) -const runnerTokenRefreshedAtAnnotation = "ambient-code.io/token-refreshed-at" - // ootbWorkflowsCache provides in-memory caching for OOTB workflows to avoid GitHub API rate limits. // The cache stores workflows by repo URL key and expires after ootbCacheTTL. type ootbWorkflowsCache struct { @@ -719,13 +714,8 @@ func CreateSession(c *gin.Context) { } }() - // Provision runner token using backend SA (requires elevated permissions for SA/Role/Secret creation) - if DynamicClient == nil || K8sClient == nil { - log.Printf("Warning: backend SA clients not available, skipping runner token provisioning for session %s/%s", project, name) - } else if err := provisionRunnerTokenForSession(c, K8sClient, DynamicClient, project, name); err != nil { - // Nonfatal: log and continue. Operator may retry later if implemented. - log.Printf("Warning: failed to provision runner token for session %s/%s: %v", project, name, err) - } + // Runner token provisioning is handled by the operator when creating the pod. + // This ensures consistent behavior whether sessions are created via API or kubectl. c.JSON(http.StatusCreated, gin.H{ "message": "Agentic session created successfully", @@ -734,171 +724,6 @@ func CreateSession(c *gin.Context) { }) } -// provisionRunnerTokenForSession creates a per-session ServiceAccount, grants minimal RBAC, -// mints a short-lived token, stores it in a Secret, and annotates the AgenticSession with the Secret name. -func provisionRunnerTokenForSession(c *gin.Context, reqK8s kubernetes.Interface, reqDyn dynamic.Interface, project string, sessionName string) error { - // Load owning AgenticSession to parent all resources - gvr := GetAgenticSessionV1Alpha1Resource() - obj, err := reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), sessionName, v1.GetOptions{}) - if err != nil { - return fmt.Errorf("get AgenticSession: %w", err) - } - ownerRef := v1.OwnerReference{ - APIVersion: obj.GetAPIVersion(), - Kind: obj.GetKind(), - Name: obj.GetName(), - UID: obj.GetUID(), - Controller: types.BoolPtr(true), - } - - // Create ServiceAccount - saName := fmt.Sprintf("ambient-session-%s", sessionName) - sa := &corev1.ServiceAccount{ - ObjectMeta: v1.ObjectMeta{ - Name: saName, - Namespace: project, - Labels: map[string]string{"app": "ambient-runner"}, - OwnerReferences: []v1.OwnerReference{ownerRef}, - }, - } - if _, err := reqK8s.CoreV1().ServiceAccounts(project).Create(c.Request.Context(), sa, v1.CreateOptions{}); err != nil { - if !errors.IsAlreadyExists(err) { - return fmt.Errorf("create SA: %w", err) - } - } - - // Create Role with least-privilege for updating AgenticSession status and annotations - roleName := fmt.Sprintf("ambient-session-%s-role", sessionName) - role := &rbacv1.Role{ - ObjectMeta: v1.ObjectMeta{ - Name: roleName, - Namespace: project, - OwnerReferences: []v1.OwnerReference{ownerRef}, - }, - Rules: []rbacv1.PolicyRule{ - { - APIGroups: []string{"vteam.ambient-code"}, - Resources: []string{"agenticsessions"}, - Verbs: []string{"get", "list", "watch", "update", "patch"}, // Added update, patch for annotations - }, - { - APIGroups: []string{"authorization.k8s.io"}, - Resources: []string{"selfsubjectaccessreviews"}, - Verbs: []string{"create"}, - }, - }, - } - // Try to create or update the Role to ensure it has latest permissions - if _, err := reqK8s.RbacV1().Roles(project).Create(c.Request.Context(), role, v1.CreateOptions{}); err != nil { - if errors.IsAlreadyExists(err) { - // Role exists - update it to ensure it has the latest permissions (including update/patch) - log.Printf("Role %s already exists, updating with latest permissions", roleName) - if _, err := reqK8s.RbacV1().Roles(project).Update(c.Request.Context(), role, v1.UpdateOptions{}); err != nil { - return fmt.Errorf("update Role: %w", err) - } - log.Printf("Successfully updated Role %s with annotation update permissions", roleName) - } else { - return fmt.Errorf("create Role: %w", err) - } - } - - // Bind Role to the ServiceAccount - rbName := fmt.Sprintf("ambient-session-%s-rb", sessionName) - rb := &rbacv1.RoleBinding{ - ObjectMeta: v1.ObjectMeta{ - Name: rbName, - Namespace: project, - OwnerReferences: []v1.OwnerReference{ownerRef}, - }, - RoleRef: rbacv1.RoleRef{APIGroup: "rbac.authorization.k8s.io", Kind: "Role", Name: roleName}, - Subjects: []rbacv1.Subject{{Kind: "ServiceAccount", Name: saName, Namespace: project}}, - } - if _, err := reqK8s.RbacV1().RoleBindings(project).Create(context.TODO(), rb, v1.CreateOptions{}); err != nil { - if !errors.IsAlreadyExists(err) { - return fmt.Errorf("create RoleBinding: %w", err) - } - } - - // Mint short-lived K8s ServiceAccount token for CR status updates - tr := &authnv1.TokenRequest{Spec: authnv1.TokenRequestSpec{}} - tok, err := reqK8s.CoreV1().ServiceAccounts(project).CreateToken(c.Request.Context(), saName, tr, v1.CreateOptions{}) - if err != nil { - return fmt.Errorf("mint token: %w", err) - } - k8sToken := tok.Status.Token - if strings.TrimSpace(k8sToken) == "" { - return fmt.Errorf("received empty token for SA %s", saName) - } - - // Only store the K8s token; GitHub tokens are minted on-demand by the runner - secretData := map[string]string{ - "k8s-token": k8sToken, - } - - // Store token in a Secret (update if exists to refresh token) - secretName := fmt.Sprintf("ambient-runner-token-%s", sessionName) - refreshedAt := time.Now().UTC().Format(time.RFC3339) - sec := &corev1.Secret{ - ObjectMeta: v1.ObjectMeta{ - Name: secretName, - Namespace: project, - Labels: map[string]string{"app": "ambient-runner-token"}, - OwnerReferences: []v1.OwnerReference{ownerRef}, - Annotations: map[string]string{ - runnerTokenRefreshedAtAnnotation: refreshedAt, - }, - }, - Type: corev1.SecretTypeOpaque, - StringData: secretData, - } - - // Try to create the secret - if _, err := reqK8s.CoreV1().Secrets(project).Create(c.Request.Context(), sec, v1.CreateOptions{}); err != nil { - if errors.IsAlreadyExists(err) { - // Secret exists - update it with fresh token - log.Printf("Updating existing secret %s with fresh token", secretName) - existing, getErr := reqK8s.CoreV1().Secrets(project).Get(c.Request.Context(), secretName, v1.GetOptions{}) - if getErr != nil { - return fmt.Errorf("get Secret for update: %w", getErr) - } - secretCopy := existing.DeepCopy() - if secretCopy.Data == nil { - secretCopy.Data = map[string][]byte{} - } - secretCopy.Data["k8s-token"] = []byte(k8sToken) - if secretCopy.Annotations == nil { - secretCopy.Annotations = map[string]string{} - } - secretCopy.Annotations[runnerTokenRefreshedAtAnnotation] = refreshedAt - if _, err := reqK8s.CoreV1().Secrets(project).Update(c.Request.Context(), secretCopy, v1.UpdateOptions{}); err != nil { - return fmt.Errorf("update Secret: %w", err) - } - log.Printf("Successfully updated secret %s with fresh token", secretName) - } else { - return fmt.Errorf("create Secret: %w", err) - } - } - - // Annotate the AgenticSession with the Secret and SA names (conflict-safe patch) - patch := map[string]interface{}{ - "metadata": map[string]interface{}{ - "annotations": map[string]string{ - "ambient-code.io/runner-token-secret": secretName, - "ambient-code.io/runner-sa": saName, - }, - }, - } - b, err := json.Marshal(patch) - if err != nil { - return fmt.Errorf("marshal patch: %w", err) - } - if _, err := reqDyn.Resource(gvr).Namespace(project).Patch(c.Request.Context(), obj.GetName(), ktypes.MergePatchType, b, v1.PatchOptions{}); err != nil { - return fmt.Errorf("annotate AgenticSession: %w", err) - } - - return nil -} - func GetSession(c *gin.Context) { project := c.GetString("project") sessionName := c.Param("sessionName") @@ -1574,6 +1399,13 @@ func RemoveRepo(c *gin.Context) { } // GetWorkflowMetadata retrieves commands and agents metadata from the active workflow +// getContentServiceName returns the ambient-content service name for a session +// Temp-content pods are deprecated - sessions must be running to access workspace +func getContentServiceName(session string) string { + return fmt.Sprintf("ambient-content-%s", session) +} + +// GetWorkflowMetadata retrieves the workflow metadata for an agentic session // GET /api/projects/:projectName/agentic-sessions/:sessionName/workflow/metadata func GetWorkflowMetadata(c *gin.Context) { project := c.GetString("project") @@ -1594,21 +1426,8 @@ func GetWorkflowMetadata(c *gin.Context) { token = c.GetHeader("X-Forwarded-Access-Token") } - // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", sessionName) - // Use the dependency-injected client selection function - reqK8s, _ := GetK8sClientsForRequest(c) - if reqK8s == nil { - c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) - c.Abort() - return - } - if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - // Temp service doesn't exist, use regular service - serviceName = fmt.Sprintf("ambient-content-%s", sessionName) - } else { - serviceName = fmt.Sprintf("ambient-content-%s", sessionName) - } + // Use ambient-content service (per-session content service) + serviceName := fmt.Sprintf("ambient-content-%s", sessionName) // Build URL to content service endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) @@ -2049,18 +1868,10 @@ func StartSession(c *gin.Context) { return } - // Check if this is a continuation (session is in a terminal phase) - isActualContinuation := false + // Log current phase for debugging if currentStatus, ok := item.Object["status"].(map[string]interface{}); ok { if phase, ok := currentStatus["phase"].(string); ok { - terminalPhases := []string{"Completed", "Failed", "Stopped", "Error"} - for _, terminalPhase := range terminalPhases { - if phase == terminalPhase { - isActualContinuation = true - log.Printf("StartSession: Detected continuation - session is in terminal phase: %s", phase) - break - } - } + log.Printf("StartSession: Current phase is %s", phase) } } @@ -2074,10 +1885,16 @@ func StartSession(c *gin.Context) { annotations["ambient-code.io/desired-phase"] = "Running" annotations["ambient-code.io/start-requested-at"] = time.Now().Format(time.RFC3339) - // For continuations, set parent-session-id so operator reuses PVC - if isActualContinuation { - annotations["vteam.ambient-code/parent-session-id"] = sessionName - log.Printf("StartSession: Continuation detected - set parent-session-id=%s for PVC reuse", sessionName) + // Clean up self-referential parent-session-id annotations. + // Old code used to set parent-session-id to the session's own name for PVC reuse, + // but this caused the runner to skip INITIAL_PROMPT thinking it was a continuation. + // With S3 storage, we don't need this anymore. Session state persists via S3 sync. + // Keep legitimate parent-session-id annotations (pointing to a DIFFERENT session). + if existingParent, ok := annotations["vteam.ambient-code/parent-session-id"]; ok { + if existingParent == sessionName { + log.Printf("StartSession: Clearing self-referential parent-session-id annotation") + delete(annotations, "vteam.ambient-code/parent-session-id") + } } item.SetAnnotations(annotations) @@ -2230,109 +2047,25 @@ func StopSession(c *gin.Context) { c.JSON(http.StatusAccepted, session) } -// EnableWorkspaceAccess requests a temporary content pod for workspace access on stopped sessions +// EnableWorkspaceAccess is deprecated - temporary content pods have been removed // POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/enable func EnableWorkspaceAccess(c *gin.Context) { - project := c.GetString("project") - sessionName := c.Param("sessionName") - gvr := GetAgenticSessionV1Alpha1Resource() - - _, k8sDyn := GetK8sClientsForRequest(c) - if k8sDyn == nil { - c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) - c.Abort() - return - } - - item, err := k8sDyn.Resource(gvr).Namespace(project).Get(context.TODO(), sessionName, v1.GetOptions{}) - if err != nil { - if errors.IsNotFound(err) { - c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"}) - return - } - c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get session"}) - return - } - - // Only allow for stopped/completed/failed sessions - status, _ := item.Object["status"].(map[string]interface{}) - phase, _ := status["phase"].(string) - if phase != "Stopped" && phase != "Completed" && phase != "Failed" { - c.JSON(http.StatusConflict, gin.H{"error": "Workspace access only available for stopped sessions"}) - return - } - - // Set annotation to request temp pod - annotations := item.GetAnnotations() - if annotations == nil { - annotations = make(map[string]string) - } - now := time.Now().UTC().Format(time.RFC3339) - annotations["ambient-code.io/temp-content-requested"] = "true" - annotations["ambient-code.io/temp-content-last-accessed"] = now - item.SetAnnotations(annotations) - - // Update CR - updated, err := k8sDyn.Resource(gvr).Namespace(project).Update(context.TODO(), item, v1.UpdateOptions{}) - if err != nil { - c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to enable workspace access"}) - return - } - - session := types.AgenticSession{ - APIVersion: updated.GetAPIVersion(), - Kind: updated.GetKind(), - Metadata: updated.Object["metadata"].(map[string]interface{}), - } - if spec, ok := updated.Object["spec"].(map[string]interface{}); ok { - session.Spec = parseSpec(spec) - } - if status, ok := updated.Object["status"].(map[string]interface{}); ok { - session.Status = parseStatus(status) - } - - log.Printf("EnableWorkspaceAccess: Set temp-content-requested annotation for %s", sessionName) - c.JSON(http.StatusAccepted, session) + c.JSON(http.StatusGone, gin.H{ + "error": "Temporary workspace access has been removed", + "message": "Session artifacts are now stored in S3. Access artifacts directly from your S3 bucket.", + "hint": "Configure S3 storage in project settings to persist session state and artifacts.", + "s3Path": fmt.Sprintf("s3://{bucket}/{namespace}/%s/", c.Param("sessionName")), + }) } // TouchWorkspaceAccess updates the last-accessed timestamp to keep temp pod alive // POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/touch func TouchWorkspaceAccess(c *gin.Context) { - project := c.GetString("project") - sessionName := c.Param("sessionName") - gvr := GetAgenticSessionV1Alpha1Resource() - - _, k8sDyn := GetK8sClientsForRequest(c) - if k8sDyn == nil { - c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) - c.Abort() - return - } - - item, err := k8sDyn.Resource(gvr).Namespace(project).Get(context.TODO(), sessionName, v1.GetOptions{}) - if err != nil { - if errors.IsNotFound(err) { - c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"}) - return - } - c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get session"}) - return - } - - annotations := item.GetAnnotations() - if annotations == nil { - annotations = make(map[string]string) - } - annotations["ambient-code.io/temp-content-last-accessed"] = time.Now().UTC().Format(time.RFC3339) - item.SetAnnotations(annotations) - - if _, err := k8sDyn.Resource(gvr).Namespace(project).Update(context.TODO(), item, v1.UpdateOptions{}); err != nil { - c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to update timestamp"}) - return - } - - log.Printf("TouchWorkspaceAccess: Updated last-accessed timestamp for %s", sessionName) - c.JSON(http.StatusOK, gin.H{"message": "Workspace access timestamp updated"}) + // Deprecated: Temp-content pods no longer exist + c.JSON(http.StatusGone, gin.H{ + "error": "Temporary workspace access has been removed", + "message": "Session artifacts are stored in S3 and do not require touch/keepalive.", + }) } // GetSessionK8sResources returns job, pod, and PVC information for a session @@ -2449,64 +2182,12 @@ func GetSessionK8sResources(c *gin.Context) { } } - // Check for temp-content pod - tempPodName := fmt.Sprintf("temp-content-%s", sessionName) - tempPod, err := k8sClt.CoreV1().Pods(project).Get(c.Request.Context(), tempPodName, v1.GetOptions{}) - if err == nil { - tempPodPhase := string(tempPod.Status.Phase) - if tempPod.DeletionTimestamp != nil { - tempPodPhase = "Terminating" - } - - containerInfos := []map[string]interface{}{} - for _, cs := range tempPod.Status.ContainerStatuses { - state := "Unknown" - var exitCode *int32 - var reason string - if cs.State.Running != nil { - state = "Running" - // If pod is terminating but container still shows running, mark as terminating - if tempPod.DeletionTimestamp != nil { - state = "Terminating" - } - } else if cs.State.Terminated != nil { - state = "Terminated" - exitCode = &cs.State.Terminated.ExitCode - reason = cs.State.Terminated.Reason - } else if cs.State.Waiting != nil { - state = "Waiting" - reason = cs.State.Waiting.Reason - } - containerInfos = append(containerInfos, map[string]interface{}{ - "name": cs.Name, - "state": state, - "exitCode": exitCode, - "reason": reason, - }) - } - podInfos = append(podInfos, map[string]interface{}{ - "name": tempPod.Name, - "phase": tempPodPhase, - "containers": containerInfos, - "isTempPod": true, - }) - } - result["pods"] = podInfos - // Get PVC info - always use session's own PVC name - // Note: If session was created with parent_session_id (via API), the operator handles PVC reuse - pvcName := fmt.Sprintf("ambient-workspace-%s", sessionName) - pvc, err := k8sClt.CoreV1().PersistentVolumeClaims(project).Get(c.Request.Context(), pvcName, v1.GetOptions{}) - result["pvcName"] = pvcName - if err == nil { - result["pvcExists"] = true - if storage, ok := pvc.Status.Capacity[corev1.ResourceStorage]; ok { - result["pvcSize"] = storage.String() - } - } else { - result["pvcExists"] = false - } + // PVCs deprecated - sessions now use EmptyDir with S3 state persistence + result["pvcExists"] = false + result["pvcName"] = "N/A (using EmptyDir + S3)" + result["storageMode"] = "EmptyDir + S3" c.JSON(http.StatusOK, result) } @@ -2529,10 +2210,11 @@ func ListSessionWorkspace(c *gin.Context) { } rel := strings.TrimSpace(c.Query("path")) - // Build absolute workspace path using plain session (no url.PathEscape to match FS paths) - absPath := "/sessions/" + session + "/workspace" + // Path is relative to content service's StateBaseDir (which is /workspace) + // Content service handles the base path, so we just pass the relative path + absPath := "" if rel != "" { - absPath += "/" + rel + absPath = rel } // Call per-job service or temp service for completed sessions @@ -2541,19 +2223,8 @@ func ListSessionWorkspace(c *gin.Context) { token = c.GetHeader("X-Forwarded-Access-Token") } - // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) - // AuthN: require user token before probing K8s Services - k8sClt, _ := GetK8sClientsForRequest(c) - if k8sClt == nil { - c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) - c.Abort() - return - } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - // Temp service doesn't exist, use regular service - serviceName = fmt.Sprintf("ambient-content-%s", session) - } + // Use ambient-content service (per-session content service) + serviceName := fmt.Sprintf("ambient-content-%s", session) endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) u := fmt.Sprintf("%s/content/list?path=%s", endpoint, url.QueryEscape(absPath)) @@ -2615,23 +2286,15 @@ func GetSessionWorkspaceFile(c *gin.Context) { } sub := strings.TrimPrefix(c.Param("path"), "/") - absPath := "/sessions/" + session + "/workspace/" + sub + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := sub token := c.GetHeader("Authorization") if strings.TrimSpace(token) == "" { token = c.GetHeader("X-Forwarded-Access-Token") } - // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) - k8sClt, _ := GetK8sClientsForRequest(c) - if k8sClt == nil { - c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) - c.Abort() - return - } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } + // Use ambient-content service (per-session content service) + serviceName := fmt.Sprintf("ambient-content-%s", session) endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) u := fmt.Sprintf("%s/content/file?path=%s", endpoint, url.QueryEscape(absPath)) @@ -2692,22 +2355,22 @@ func PutSessionWorkspaceFile(c *gin.Context) { // Validate and sanitize path to prevent directory traversal // Use robust path validation that works across platforms sub := strings.TrimPrefix(c.Param("path"), "/") - workspaceBase := "/sessions/" + session + "/workspace" + workspaceBase := "/workspace" - // Construct absolute path using filepath.Join for proper path handling - absPath := filepath.Join(workspaceBase, sub) + // Construct absolute path using filepath.Join for path validation + validationPath := filepath.Join(workspaceBase, sub) // Use robust path validation from pathutil package // This is more secure than manual string checks and works across platforms - if !pathutil.IsPathWithinBase(absPath, workspaceBase) { - log.Printf("PutSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", absPath, workspaceBase) + if !pathutil.IsPathWithinBase(validationPath, workspaceBase) { + log.Printf("PutSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", validationPath, workspaceBase) c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid path: must be within workspace directory"}) return } + // Use relative path for content service (it has its own StateBaseDir=/workspace) // Convert to forward slashes for content service (expects POSIX paths) - // filepath.Join may use backslashes on Windows, but content service always uses forward slashes - absPath = filepath.ToSlash(absPath) + absPath := filepath.ToSlash(sub) token := c.GetHeader("Authorization") if strings.TrimSpace(token) == "" { @@ -2740,7 +2403,7 @@ func PutSessionWorkspaceFile(c *gin.Context) { // Verify session exists using reqDyn AFTER RBAC check // This prevents enumeration attacks - unauthorized users get same "Forbidden" response gvr := GetAgenticSessionV1Alpha1Resource() - item, err := reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), session, v1.GetOptions{}) + _, err = reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), session, v1.GetOptions{}) if err != nil { if errors.IsNotFound(err) { c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"}) @@ -2750,60 +2413,15 @@ func PutSessionWorkspaceFile(c *gin.Context) { return } - // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) - serviceFound := false - + // Check if ambient-content service exists (session must be running) + serviceName := fmt.Sprintf("ambient-content-%s", session) if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - // Temp service doesn't exist, try regular service - serviceName = fmt.Sprintf("ambient-content-%s", session) - if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - // Neither service exists - need to spawn temp content pod - log.Printf("PutSessionWorkspaceFile: No content service found for session %s, requesting temp pod", session) - serviceFound = false - } else { - serviceFound = true - } - } else { - serviceFound = true - } - - // If no service exists, request temp content pod and return accepted status - // We already have the session item from the existence check above - if !serviceFound { - - // Check if temp content was already requested (avoid duplicate pod creation) - annotations := item.GetAnnotations() - if annotations != nil && annotations["ambient-code.io/temp-content-requested"] == "true" { - log.Printf("PutSessionWorkspaceFile: Temp content already requested for session %s", session) - c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"}) - return - } - - // Request temp content pod via annotation - if annotations == nil { - annotations = make(map[string]string) - } - now := time.Now().UTC().Format(time.RFC3339) - annotations["ambient-code.io/temp-content-requested"] = "true" - annotations["ambient-code.io/temp-content-last-accessed"] = now - item.SetAnnotations(annotations) - - // Use optimistic locking - if resource was modified between Get and Update, K8s returns conflict - if _, err := reqDyn.Resource(gvr).Namespace(project).Update(c.Request.Context(), item, v1.UpdateOptions{}); err != nil { - if errors.IsConflict(err) { - // Another request updated the resource - likely also requested temp pod - log.Printf("PutSessionWorkspaceFile: Conflict updating session %s (concurrent request), treating as already requested", session) - c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"}) - return - } - log.Printf("PutSessionWorkspaceFile: Failed to request temp pod: %v", err) - c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available, please try again in a few seconds"}) - return - } - - log.Printf("PutSessionWorkspaceFile: Requested temp content pod for session %s", session) - c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"}) + // Service doesn't exist - session is not running + log.Printf("PutSessionWorkspaceFile: Content service not found for session %s (session not running)", session) + c.JSON(http.StatusConflict, gin.H{ + "error": "Session is not running. Start the session to upload files.", + "hint": "File uploads require an active session. Start the session and try again.", + }) return } @@ -2910,22 +2528,22 @@ func DeleteSessionWorkspaceFile(c *gin.Context) { // Validate and sanitize path to prevent directory traversal // Use robust path validation that works across platforms sub := strings.TrimPrefix(c.Param("path"), "/") - workspaceBase := "/sessions/" + session + "/workspace" + workspaceBase := "/workspace" - // Construct absolute path using filepath.Join for proper path handling - absPath := filepath.Join(workspaceBase, sub) + // Construct absolute path using filepath.Join for path validation + validationPath := filepath.Join(workspaceBase, sub) // Use robust path validation from pathutil package // This is more secure than manual string checks and works across platforms - if !pathutil.IsPathWithinBase(absPath, workspaceBase) { - log.Printf("DeleteSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", absPath, workspaceBase) + if !pathutil.IsPathWithinBase(validationPath, workspaceBase) { + log.Printf("DeleteSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", validationPath, workspaceBase) c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid path: must be within workspace directory"}) return } + // Use relative path for content service (it has its own StateBaseDir=/workspace) // Convert to forward slashes for content service (expects POSIX paths) - // filepath.Join may use backslashes on Windows, but content service always uses forward slashes - absPath = filepath.ToSlash(absPath) + absPath := filepath.ToSlash(sub) token := c.GetHeader("Authorization") if strings.TrimSpace(token) == "" { @@ -2968,26 +2586,11 @@ func DeleteSessionWorkspaceFile(c *gin.Context) { return } - // Try temp service first, then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) - serviceFound := false - + // Check if content service exists (session must be running) + serviceName := getContentServiceName(session) if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - // Temp service doesn't exist, try regular service - serviceName = fmt.Sprintf("ambient-content-%s", session) - if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - log.Printf("DeleteSessionWorkspaceFile: No content service found for session %s", session) - c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available"}) - return - } else { - serviceFound = true - } - } else { - serviceFound = true - } - - if !serviceFound { - c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available"}) + log.Printf("DeleteSessionWorkspaceFile: Content service not found for session %s (session not running)", session) + c.JSON(http.StatusConflict, gin.H{"error": "Session is not running. Start the session to access files."}) return } @@ -3060,16 +2663,13 @@ func PushSessionRepo(c *gin.Context) { log.Printf("pushSessionRepo: request project=%s session=%s repoIndex=%d commitLen=%d", project, session, body.RepoIndex, len(strings.TrimSpace(body.CommitMessage))) // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, k8sDyn := GetK8sClientsForRequest(c) if k8sClt == nil || k8sDyn == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) log.Printf("pushSessionRepo: using service %s", serviceName) @@ -3092,11 +2692,12 @@ func PushSessionRepo(c *gin.Context) { } rm, _ := repos[body.RepoIndex].(map[string]interface{}) // Derive repoPath from input URL folder name + // Paths are relative to content service's StateBaseDir (which is /workspace) if in, ok := rm["input"].(map[string]interface{}); ok { if urlv, ok2 := in["url"].(string); ok2 && strings.TrimSpace(urlv) != "" { folder := DeriveRepoFolderFromURL(strings.TrimSpace(urlv)) if folder != "" { - resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace/%s", session, folder) + resolvedRepoPath = folder } } } @@ -3113,9 +2714,9 @@ func PushSessionRepo(c *gin.Context) { // If input URL missing or unparsable, fall back to numeric index path (last resort) if strings.TrimSpace(resolvedRepoPath) == "" { if body.RepoIndex >= 0 { - resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace/%d", session, body.RepoIndex) + resolvedRepoPath = fmt.Sprintf("%d", body.RepoIndex) } else { - resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace", session) + resolvedRepoPath = "" } } if strings.TrimSpace(resolvedOutputURL) == "" { @@ -3229,24 +2830,21 @@ func AbandonSessionRepo(c *gin.Context) { } // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) log.Printf("AbandonSessionRepo: using service %s", serviceName) repoPath := strings.TrimSpace(body.RepoPath) if repoPath == "" { if body.RepoIndex >= 0 { - repoPath = fmt.Sprintf("/sessions/%s/workspace/%d", session, body.RepoIndex) + repoPath = fmt.Sprintf("%d", body.RepoIndex) } else { - repoPath = fmt.Sprintf("/sessions/%s/workspace", session) + repoPath = "" } } payload := map[string]interface{}{ @@ -3302,8 +2900,9 @@ func DiffSessionRepo(c *gin.Context) { session := c.Param("sessionName") repoIndexStr := strings.TrimSpace(c.Query("repoIndex")) repoPath := strings.TrimSpace(c.Query("repoPath")) + // Paths are relative to content service's StateBaseDir (which is /workspace) if repoPath == "" && repoIndexStr != "" { - repoPath = fmt.Sprintf("/sessions/%s/workspace/%s", session, repoIndexStr) + repoPath = repoIndexStr } if repoPath == "" { c.JSON(http.StatusBadRequest, gin.H{"error": "missing repoPath/repoIndex"}) @@ -3311,16 +2910,13 @@ func DiffSessionRepo(c *gin.Context) { } // Try temp service first (for completed sessions), then regular service - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project) log.Printf("DiffSessionRepo: using service %s", serviceName) url := fmt.Sprintf("%s/content/github/diff?repoPath=%s", endpoint, url.QueryEscape(repoPath)) @@ -3372,20 +2968,17 @@ func GetGitStatus(c *gin.Context) { return } - // Build absolute path - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := relativePath // Get content service endpoint - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-status?path=%s", serviceName, project, url.QueryEscape(absPath)) @@ -3443,20 +3036,17 @@ func ConfigureGitRemote(c *gin.Context) { body.Branch = "main" } - // Build absolute path - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", sessionName, body.Path) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := body.Path // Get content service endpoint - serviceName := fmt.Sprintf("temp-content-%s", sessionName) + serviceName := getContentServiceName(sessionName) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", sessionName) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-configure-remote", serviceName, project) @@ -3560,20 +3150,17 @@ func SynchronizeGit(c *gin.Context) { body.Message = fmt.Sprintf("Session %s - %s", session, time.Now().Format(time.RFC3339)) } - // Build absolute path - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := body.Path // Get content service endpoint - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-sync", serviceName, project) @@ -3630,18 +3217,16 @@ func GetGitMergeStatus(c *gin.Context) { branch = "main" } - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := relativePath - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-merge-status?path=%s&branch=%s", serviceName, project, url.QueryEscape(absPath), url.QueryEscape(branch)) @@ -3690,18 +3275,16 @@ func GitPullSession(c *gin.Context) { body.Branch = "main" } - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := body.Path - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-pull", serviceName, project) @@ -3769,18 +3352,16 @@ func GitPushSession(c *gin.Context) { body.Message = fmt.Sprintf("Session %s artifacts", session) } - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := body.Path - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-push", serviceName, project) @@ -3842,18 +3423,16 @@ func GitCreateBranchSession(c *gin.Context) { body.Path = "artifacts" } - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := body.Path - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-create-branch", serviceName, project) @@ -3905,18 +3484,16 @@ func GitListBranchesSession(c *gin.Context) { relativePath = "artifacts" } - absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath) + // Path is relative to content service's StateBaseDir (which is /workspace) + absPath := relativePath - serviceName := fmt.Sprintf("temp-content-%s", session) + serviceName := getContentServiceName(session) k8sClt, _ := GetK8sClientsForRequest(c) if k8sClt == nil { c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) c.Abort() return } - if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil { - serviceName = fmt.Sprintf("ambient-content-%s", session) - } endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-list-branches?path=%s", serviceName, project, url.QueryEscape(absPath)) diff --git a/components/backend/server/server.go b/components/backend/server/server.go index a6465e055..5e05caba4 100644 --- a/components/backend/server/server.go +++ b/components/backend/server/server.go @@ -2,11 +2,15 @@ package server import ( + "context" "fmt" "log" "net/http" "os" + "os/signal" "strings" + "syscall" + "time" "github.com/gin-contrib/cors" "github.com/gin-gonic/gin" @@ -95,7 +99,7 @@ func forwardedIdentityMiddleware() gin.HandlerFunc { } } -// RunContentService starts the server in content service mode +// RunContentService starts the server in content service mode with graceful shutdown func RunContentService(registerContentRoutes RouterFunc) error { r := gin.New() r.Use(gin.Recovery()) @@ -124,9 +128,39 @@ func RunContentService(registerContentRoutes RouterFunc) error { if port == "" { port = "8080" } - log.Printf("Content service starting on port %s", port) - if err := r.Run(":" + port); err != nil { - return fmt.Errorf("failed to start content service: %v", err) + + // Create HTTP server for graceful shutdown + srv := &http.Server{ + Addr: ":" + port, + Handler: r, + } + + // Channel to receive shutdown signal + quit := make(chan os.Signal, 1) + signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM) + + // Start server in goroutine + go func() { + log.Printf("Content service starting on port %s", port) + if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed { + log.Fatalf("Content service listen error: %v", err) + } + }() + + // Wait for shutdown signal + sig := <-quit + log.Printf("Content service received signal %v, shutting down gracefully...", sig) + + // Create shutdown context with timeout + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + // Attempt graceful shutdown + if err := srv.Shutdown(ctx); err != nil { + log.Printf("Content service forced to shutdown: %v", err) + return err } + + log.Println("Content service shutdown complete") return nil } diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx index 1776504c9..da3c28e76 100644 --- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx +++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx @@ -90,7 +90,6 @@ import { useSession, useStopSession, useDeleteSession, - useSessionK8sResources, useContinueSession, } from "@/services/queries"; import { @@ -192,10 +191,6 @@ export default function ProjectSessionDetailPage({ error, refetch: refetchSession, } = useSession(projectName, sessionName); - const { data: k8sResources } = useSessionK8sResources( - projectName, - sessionName, - ); const stopMutation = useStopSession(); const deleteMutation = useDeleteSession(); const continueMutation = useContinueSession(); @@ -1256,9 +1251,6 @@ export default function ProjectSessionDetailPage({ ); }; - // Duration calculation removed - startTime/completionTime no longer in status - const durationMs = undefined; - // Loading state if (isLoading || !projectName || !sessionName) { return ( @@ -1383,9 +1375,6 @@ export default function ProjectSessionDetailPage({ onStop={handleStop} onContinue={handleContinue} onDelete={handleDelete} - durationMs={durationMs} - k8sResources={k8sResources} - messageCount={aguiState.messages.length} renderMode="kebab-only" /> diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx index 0f23916cd..985e49d87 100644 --- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx +++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx @@ -19,12 +19,6 @@ type SessionHeaderProps = { onStop: () => void; onContinue: () => void; onDelete: () => void; - durationMs?: number; - k8sResources?: { - pvcName?: string; - pvcSize?: string; - }; - messageCount: number; renderMode?: 'full' | 'actions-only' | 'kebab-only'; }; @@ -36,9 +30,6 @@ export function SessionHeader({ onStop, onContinue, onDelete, - durationMs, - k8sResources, - messageCount, renderMode = 'full', }: SessionHeaderProps) { const [detailsModalOpen, setDetailsModalOpen] = useState(false); @@ -146,9 +137,6 @@ export function SessionHeader({ projectName={projectName} open={detailsModalOpen} onOpenChange={setDetailsModalOpen} - durationMs={durationMs} - k8sResources={k8sResources} - messageCount={messageCount} /> 0) { - const remainingMinutes = minutes % 60; - const remainingSeconds = seconds % 60; - return `${hours}h ${remainingMinutes}m ${remainingSeconds}s`; - } else if (minutes > 0) { - const remainingSeconds = seconds % 60; - return `${minutes}m ${remainingSeconds}s`; - } else { - return `${seconds}s`; - } -} - type SessionDetailsModalProps = { session: AgenticSession; projectName: string; open: boolean; onOpenChange: (open: boolean) => void; - durationMs?: number; - k8sResources?: { - pvcName?: string; - pvcSize?: string; - }; - messageCount: number; }; export function SessionDetailsModal({ @@ -45,9 +23,6 @@ export function SessionDetailsModal({ projectName, open, onOpenChange, - durationMs, - k8sResources, - messageCount, }: SessionDetailsModalProps) { const [exportingAgui, setExportingAgui] = useState(false); const [exportingLegacy, setExportingLegacy] = useState(false); @@ -113,44 +88,6 @@ export function SessionDetailsModal({ {session.spec.llmSettings.model} -
- Temperature: - {session.spec.llmSettings.temperature} -
- -
- Mode: - {session.spec?.interactive ? "Interactive" : "Headless"} -
- - {/* startTime removed from simplified status */} - -
- Duration: - {typeof durationMs === "number" ? formatDuration(durationMs) : "-"} -
- - {k8sResources?.pvcName && ( -
- PVC: - {k8sResources.pvcName} -
- )} - - {k8sResources?.pvcSize && ( -
- PVC Size: - {k8sResources.pvcSize} -
- )} - - {/* jobName removed from simplified status */} - -
- Messages: - {messageCount} -
- {/* Export buttons */}
{loadingExport ? ( @@ -210,25 +147,42 @@ export function SessionDetailsModal({ {session.status?.conditions && session.status.conditions.length > 0 && (
Reconciliation Conditions
-
+ {session.status.conditions.map((condition, index) => ( -
-
- {condition.type} - - {condition.status} - -
-
{condition.reason || "No reason provided"}
- {condition.message && ( -
{condition.message}
- )} - {condition.lastTransitionTime && ( -
Updated {new Date(condition.lastTransitionTime).toLocaleString()}
- )} -
+ + +
+ {condition.type} + + {condition.status} + +
+
+ +
+
+ Reason: + {condition.reason || "No reason provided"} +
+ {condition.message && ( +
+ Message: +

{condition.message}

+
+ )} + {condition.lastTransitionTime && ( +
+ Updated {new Date(condition.lastTransitionTime).toLocaleString()} +
+ )} +
+
+
))} -
+
)}
diff --git a/components/frontend/src/components/workspace-sections/settings-section.tsx b/components/frontend/src/components/workspace-sections/settings-section.tsx index 607da5ef8..f69ec9e4a 100644 --- a/components/frontend/src/components/workspace-sections/settings-section.tsx +++ b/components/frontend/src/components/workspace-sections/settings-section.tsx @@ -38,11 +38,19 @@ export function SettingsSection({ projectName }: SettingsSectionProps) { const [gitlabToken, setGitlabToken] = useState(""); const [gitlabInstanceUrl, setGitlabInstanceUrl] = useState(""); const [showGitlabToken, setShowGitlabToken] = useState(false); + const [storageMode, setStorageMode] = useState<"shared" | "custom">("shared"); + const [s3Endpoint, setS3Endpoint] = useState(""); + const [s3Bucket, setS3Bucket] = useState(""); + const [s3Region, setS3Region] = useState("us-east-1"); + const [s3AccessKey, setS3AccessKey] = useState(""); + const [s3SecretKey, setS3SecretKey] = useState(""); + const [showS3SecretKey, setShowS3SecretKey] = useState(false); const [anthropicExpanded, setAnthropicExpanded] = useState(false); const [githubExpanded, setGithubExpanded] = useState(false); const [jiraExpanded, setJiraExpanded] = useState(false); const [gitlabExpanded, setGitlabExpanded] = useState(false); - const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY","GIT_USER_NAME","GIT_USER_EMAIL","GITHUB_TOKEN","JIRA_URL","JIRA_PROJECT","JIRA_EMAIL","JIRA_API_TOKEN","GITLAB_TOKEN","GITLAB_INSTANCE_URL"] as const, []); + const [s3Expanded, setS3Expanded] = useState(false); + const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY","GIT_USER_NAME","GIT_USER_EMAIL","GITHUB_TOKEN","JIRA_URL","JIRA_PROJECT","JIRA_EMAIL","JIRA_API_TOKEN","GITLAB_TOKEN","GITLAB_INSTANCE_URL","STORAGE_MODE","S3_ENDPOINT","S3_BUCKET","S3_REGION","S3_ACCESS_KEY","S3_SECRET_KEY"] as const, []); // React Query hooks const { data: project, isLoading: projectLoading } = useProject(projectName); @@ -75,6 +83,14 @@ export function SettingsSection({ projectName }: SettingsSectionProps) { setJiraToken(byKey["JIRA_API_TOKEN"] || ""); setGitlabToken(byKey["GITLAB_TOKEN"] || ""); setGitlabInstanceUrl(byKey["GITLAB_INSTANCE_URL"] || ""); + // Determine storage mode: "custom" if S3_ENDPOINT is set, otherwise "shared" (default) + const hasCustomS3 = byKey["STORAGE_MODE"] === "custom" || (byKey["S3_ENDPOINT"] && byKey["S3_ENDPOINT"] !== ""); + setStorageMode(hasCustomS3 ? "custom" : "shared"); + setS3Endpoint(byKey["S3_ENDPOINT"] || ""); + setS3Bucket(byKey["S3_BUCKET"] || ""); + setS3Region(byKey["S3_REGION"] || "us-east-1"); + setS3AccessKey(byKey["S3_ACCESS_KEY"] || ""); + setS3SecretKey(byKey["S3_SECRET_KEY"] || ""); setSecrets(allSecrets.filter(s => !FIXED_KEYS.includes(s.key as typeof FIXED_KEYS[number]))); } }, [runnerSecrets, integrationSecrets, FIXED_KEYS]); @@ -147,6 +163,18 @@ export function SettingsSection({ projectName }: SettingsSectionProps) { if (jiraToken) integrationData["JIRA_API_TOKEN"] = jiraToken; if (gitlabToken) integrationData["GITLAB_TOKEN"] = gitlabToken; if (gitlabInstanceUrl) integrationData["GITLAB_INSTANCE_URL"] = gitlabInstanceUrl; + + // S3 Storage configuration + integrationData["STORAGE_MODE"] = storageMode; + if (storageMode === "custom") { + // Only save custom S3 settings when custom mode is selected + if (s3Endpoint) integrationData["S3_ENDPOINT"] = s3Endpoint; + if (s3Bucket) integrationData["S3_BUCKET"] = s3Bucket; + if (s3Region) integrationData["S3_REGION"] = s3Region; + if (s3AccessKey) integrationData["S3_ACCESS_KEY"] = s3AccessKey; + if (s3SecretKey) integrationData["S3_SECRET_KEY"] = s3SecretKey; + } + // If shared mode: backend will use operator defaults + minio-credentials secret for (const { key, value } of secrets) { if (!key) continue; if (FIXED_KEYS.includes(key as typeof FIXED_KEYS[number])) continue; @@ -468,6 +496,137 @@ export function SettingsSection({ projectName }: SettingsSectionProps) { )} + {/* S3 Storage Configuration Section */} +
+
setS3Expanded((v) => !v)} + > +
+ +
Configure S3-compatible storage for session artifacts and state
+
+ {s3Expanded ? : } +
+ {s3Expanded && ( +
+ + + Session State Storage + + Session artifacts, uploads, and Claude history are persisted to S3-compatible storage. By default, the cluster provides shared MinIO storage. + + +
+ +
+
+ setStorageMode("shared")} + className="h-4 w-4" + /> + +
+
+ Automatically uses in-cluster MinIO. No configuration needed. +
+
+
+
+ setStorageMode("custom")} + className="h-4 w-4" + /> + +
+
+ Configure AWS S3, external MinIO, or other S3-compatible endpoint. +
+
+
+ {storageMode === "custom" && ( + <> +
+ +
S3-compatible endpoint (e.g., https://s3.amazonaws.com, http://minio.local:9000)
+ setS3Endpoint(e.target.value)} + /> +
+
+ +
Bucket name for session storage
+ setS3Bucket(e.target.value)} + /> +
+
+ +
AWS region (optional, default: us-east-1)
+ setS3Region(e.target.value)} + /> +
+
+ +
S3 access key ID
+ setS3AccessKey(e.target.value)} + /> +
+
+ +
S3 secret access key
+
+ setS3SecretKey(e.target.value)} + className="flex-1" + /> + +
+
+ + )} +
+ )} +
+ {/* Custom Environment Variables Section */}
diff --git a/components/frontend/src/types/project-settings.ts b/components/frontend/src/types/project-settings.ts index ccb9ebd0f..d0aff5cd9 100644 --- a/components/frontend/src/types/project-settings.ts +++ b/components/frontend/src/types/project-settings.ts @@ -4,11 +4,19 @@ export type LLMSettings = { maxTokens: number; }; +export type S3StorageConfig = { + enabled: boolean; + endpoint: string; + bucket: string; + region?: string; +}; + export type ProjectDefaultSettings = { llmSettings: LLMSettings; defaultTimeout: number; allowedWebsiteDomains?: string[]; maxConcurrentSessions: number; + s3Storage?: S3StorageConfig; }; export type ProjectResourceLimits = { diff --git a/components/manifests/base/kustomization.yaml b/components/manifests/base/kustomization.yaml index 58c3c658b..e35dc92cb 100644 --- a/components/manifests/base/kustomization.yaml +++ b/components/manifests/base/kustomization.yaml @@ -13,6 +13,7 @@ resources: - frontend-deployment.yaml - operator-deployment.yaml - workspace-pvc.yaml +- minio-deployment.yaml # Default images (can be overridden by overlays) images: @@ -24,4 +25,6 @@ images: newTag: latest - name: quay.io/ambient_code/vteam_claude_runner newTag: latest +- name: quay.io/ambient_code/vteam_state_sync + newTag: latest diff --git a/components/manifests/base/minio-credentials-secret.yaml.example b/components/manifests/base/minio-credentials-secret.yaml.example new file mode 100644 index 000000000..58472d078 --- /dev/null +++ b/components/manifests/base/minio-credentials-secret.yaml.example @@ -0,0 +1,31 @@ +apiVersion: v1 +kind: Secret +metadata: + name: minio-credentials +type: Opaque +stringData: + # MinIO root credentials + # Change these values in production! + root-user: "admin" + root-password: "changeme123" + + # For use in project settings (same credentials for convenience) + access-key: "admin" + secret-key: "changeme123" +--- +# Instructions: +# 1. Copy this file to minio-credentials-secret.yaml +# 2. Change root-user and root-password to secure values +# 3. Apply: kubectl apply -f minio-credentials-secret.yaml -n ambient-code +# +# After MinIO is running: +# 1. Access MinIO console: kubectl port-forward svc/minio 9001:9001 -n ambient-code +# 2. Open http://localhost:9001 in browser +# 3. Login with root-user/root-password +# 4. Create bucket: "ambient-sessions" +# 5. Configure bucket in project settings: +# - S3_ENDPOINT: http://minio.ambient-code.svc:9000 +# - S3_BUCKET: ambient-sessions +# - S3_ACCESS_KEY: {your-root-user} +# - S3_SECRET_KEY: {your-root-password} + diff --git a/components/manifests/base/minio-deployment.yaml b/components/manifests/base/minio-deployment.yaml new file mode 100644 index 000000000..f537d4d74 --- /dev/null +++ b/components/manifests/base/minio-deployment.yaml @@ -0,0 +1,102 @@ +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: minio-data + labels: + app: minio +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 50Gi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: minio + labels: + app: minio +spec: + replicas: 1 + selector: + matchLabels: + app: minio + template: + metadata: + labels: + app: minio + spec: + containers: + - name: minio + image: quay.io/minio/minio:latest + args: + - server + - /data + - --console-address + - ":9001" + env: + - name: MINIO_ROOT_USER + valueFrom: + secretKeyRef: + name: minio-credentials + key: root-user + - name: MINIO_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: minio-credentials + key: root-password + ports: + - containerPort: 9000 + name: api + protocol: TCP + - containerPort: 9001 + name: console + protocol: TCP + volumeMounts: + - name: data + mountPath: /data + livenessProbe: + httpGet: + path: /minio/health/live + port: 9000 + initialDelaySeconds: 30 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /minio/health/ready + port: 9000 + initialDelaySeconds: 10 + periodSeconds: 5 + resources: + requests: + cpu: 250m + memory: 512Mi + limits: + cpu: 1000m + memory: 2Gi + volumes: + - name: data + persistentVolumeClaim: + claimName: minio-data +--- +apiVersion: v1 +kind: Service +metadata: + name: minio + labels: + app: minio +spec: + type: ClusterIP + ports: + - port: 9000 + targetPort: 9000 + protocol: TCP + name: api + - port: 9001 + targetPort: 9001 + protocol: TCP + name: console + selector: + app: minio + diff --git a/components/manifests/base/operator-deployment.yaml b/components/manifests/base/operator-deployment.yaml index fe8d38056..fe6a7b08e 100644 --- a/components/manifests/base/operator-deployment.yaml +++ b/components/manifests/base/operator-deployment.yaml @@ -19,7 +19,21 @@ spec: - name: agentic-operator image: quay.io/ambient_code/vteam_operator:latest imagePullPolicy: Always + args: + # Controller-runtime configuration + - --max-concurrent-reconciles=10 # Process up to 10 sessions in parallel + - --health-probe-bind-address=:8081 + - --leader-elect=false # Enable for HA deployments with replicas > 1 + # Uncomment for debugging with legacy watch-based implementation: + # - --legacy-watch + ports: + - containerPort: 8081 + name: health + protocol: TCP env: + # Controller concurrency (can be overridden via args) + - name: MAX_CONCURRENT_RECONCILES + value: "10" - name: NAMESPACE valueFrom: fieldRef: @@ -35,7 +49,7 @@ spec: - name: CONTENT_SERVICE_IMAGE value: "quay.io/ambient_code/vteam_backend:latest" - name: IMAGE_PULL_POLICY - value: "Always" + value: "IfNotPresent" # Vertex AI configuration from ConfigMap - name: CLAUDE_CODE_USE_VERTEX valueFrom: @@ -96,6 +110,20 @@ spec: name: google-workflow-app-secret key: GOOGLE_OAUTH_CLIENT_SECRET optional: true + # S3 state sync configuration (defaults - can be overridden per-project in settings) + - name: STATE_SYNC_IMAGE + value: "quay.io/ambient_code/vteam_state_sync:latest" + - name: S3_ENDPOINT + value: "http://minio.ambient-code.svc:9000" # In-cluster MinIO (change for external S3) + - name: S3_BUCKET + value: "ambient-sessions" # Create this bucket in MinIO console + # OpenTelemetry configuration + - name: OTEL_EXPORTER_OTLP_ENDPOINT + value: "otel-collector.ambient-code.svc:4317" # Deploy OTel collector separately + - name: DEPLOYMENT_ENV + value: "production" + - name: VERSION + value: "latest" # Override with actual version in production resources: requests: cpu: 50m @@ -104,11 +132,15 @@ spec: cpu: 200m memory: 256Mi livenessProbe: - exec: - command: - - /bin/sh - - -c - - "ps aux | grep '[o]perator' || exit 1" - initialDelaySeconds: 30 + httpGet: + path: /healthz + port: health + initialDelaySeconds: 15 + periodSeconds: 20 + readinessProbe: + httpGet: + path: /readyz + port: health + initialDelaySeconds: 5 periodSeconds: 10 restartPolicy: Always diff --git a/components/manifests/base/rbac/operator-clusterrole.yaml b/components/manifests/base/rbac/operator-clusterrole.yaml index e5a6b97ae..6d19ba779 100644 --- a/components/manifests/base/rbac/operator-clusterrole.yaml +++ b/components/manifests/base/rbac/operator-clusterrole.yaml @@ -25,10 +25,10 @@ rules: - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get", "list", "watch", "create", "delete"] -# Pods (for getting logs from failed jobs and cleanup on stop) +# Pods (create runner pods directly, get logs, and cleanup on stop) - apiGroups: [""] resources: ["pods"] - verbs: ["get", "list", "watch", "delete", "deletecollection"] + verbs: ["get", "list", "watch", "create", "delete", "deletecollection"] - apiGroups: [""] resources: ["pods/log"] verbs: ["get"] diff --git a/components/manifests/deploy.sh b/components/manifests/deploy.sh index ba0a3ba90..c3f33eb3a 100755 --- a/components/manifests/deploy.sh +++ b/components/manifests/deploy.sh @@ -133,6 +133,7 @@ DEFAULT_BACKEND_IMAGE="${DEFAULT_BACKEND_IMAGE:-${CONTAINER_REGISTRY}/vteam_back DEFAULT_FRONTEND_IMAGE="${DEFAULT_FRONTEND_IMAGE:-${CONTAINER_REGISTRY}/vteam_frontend:${IMAGE_TAG}}" DEFAULT_OPERATOR_IMAGE="${DEFAULT_OPERATOR_IMAGE:-${CONTAINER_REGISTRY}/vteam_operator:${IMAGE_TAG}}" DEFAULT_RUNNER_IMAGE="${DEFAULT_RUNNER_IMAGE:-${CONTAINER_REGISTRY}/vteam_claude_runner:${IMAGE_TAG}}" +DEFAULT_STATE_SYNC_IMAGE="${DEFAULT_STATE_SYNC_IMAGE:-${CONTAINER_REGISTRY}/vteam_state_sync:${IMAGE_TAG}}" # Content service image (defaults to same as backend, but can be overridden) CONTENT_SERVICE_IMAGE="${CONTENT_SERVICE_IMAGE:-${DEFAULT_BACKEND_IMAGE}}" @@ -233,6 +234,7 @@ echo -e "Backend Image: ${GREEN}${DEFAULT_BACKEND_IMAGE}${NC}" echo -e "Frontend Image: ${GREEN}${DEFAULT_FRONTEND_IMAGE}${NC}" echo -e "Operator Image: ${GREEN}${DEFAULT_OPERATOR_IMAGE}${NC}" echo -e "Runner Image: ${GREEN}${DEFAULT_RUNNER_IMAGE}${NC}" +echo -e "State Sync Image: ${GREEN}${DEFAULT_STATE_SYNC_IMAGE}${NC}" echo -e "Content Service Image: ${GREEN}${CONTENT_SERVICE_IMAGE}${NC}" echo "" @@ -305,6 +307,7 @@ kustomize edit set image quay.io/ambient_code/vteam_backend:latest=${DEFAULT_BAC kustomize edit set image quay.io/ambient_code/vteam_frontend:latest=${DEFAULT_FRONTEND_IMAGE} kustomize edit set image quay.io/ambient_code/vteam_operator:latest=${DEFAULT_OPERATOR_IMAGE} kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=${DEFAULT_RUNNER_IMAGE} +kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=${DEFAULT_STATE_SYNC_IMAGE} # Build and apply manifests echo -e "${BLUE}Building and applying manifests...${NC}" @@ -428,6 +431,7 @@ kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambie kustomize edit set image quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:latest kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:latest kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:latest +kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:latest cd ../.. echo -e "${GREEN}🎯 Ready to create RFE workflows with multi-agent collaboration!${NC}" diff --git a/components/manifests/observability/README.md b/components/manifests/observability/README.md new file mode 100644 index 000000000..1513a8eb2 --- /dev/null +++ b/components/manifests/observability/README.md @@ -0,0 +1,191 @@ +# Observability Stack for Ambient Code Platform + +Observability for OpenShift using **User Workload Monitoring** (no dedicated Prometheus needed). + +## Architecture + +``` +Operator (OTel SDK) → OTel Collector → OpenShift Prometheus + ↓ + OpenShift Console + ↓ + Grafana (optional) +``` + +## Quick Start + +### Deploy Base Stack + +```bash +# From repository root +make deploy-observability + +# Or manually +kubectl apply -k components/manifests/observability/ +``` + +**What you get**: OTel Collector + ServiceMonitor (128MB) + +### View Metrics + +Open **OpenShift Console → Observe → Metrics** and query: +- `ambient_sessions_total` +- `ambient_session_startup_duration_bucket` +- `ambient_session_errors` + +--- + +## Optional: Add Grafana + +If you want custom dashboards: + +```bash +# Add Grafana overlay +kubectl apply -k components/manifests/observability/overlays/with-grafana/ +``` + +**Adds**: Grafana (additional 128MB) - still uses OpenShift Prometheus + +**Access Grafana**: +```bash +# Create route +oc create route edge grafana --service=grafana -n ambient-code + +# Get URL +oc get route grafana -n ambient-code -o jsonpath='{.spec.host}' +# Login: admin/admin +``` + +**Import dashboard**: Upload `dashboards/ambient-operator-dashboard.json` in Grafana UI + +--- + +## Components + +| Component | What It Does | Resource Usage | +|-----------|--------------|----------------| +| **OTel Collector** | Receives metrics from operator, exports to Prometheus format | 128MB RAM | +| **ServiceMonitor** | Tells OpenShift Prometheus to scrape OTel Collector | None | +| **Grafana** (optional) | Custom dashboards | 128MB RAM, 5GB storage | + +## Metrics Available + +All metrics are prefixed with `ambient_`: + +| Metric | Type | Description | Alert Threshold | +|--------|------|-------------|-----------------| +| `ambient_session_startup_duration` | Histogram | Time from creation to Running phase | p95 > 60s | +| `ambient_session_phase_transitions` | Counter | Phase transition events | - | +| `ambient_sessions_total` | Counter | Total sessions created | Sudden spikes | +| `ambient_sessions_completed` | Counter | Sessions that reached terminal states | - | +| `ambient_reconcile_duration` | Histogram | Reconciliation loop performance | p95 > 10s | +| `ambient_pod_creation_duration` | Histogram | Time to create runner pods | p95 > 30s | +| `ambient_token_provision_duration` | Histogram | Token provisioning time | p95 > 5s | +| `ambient_session_errors` | Counter | Errors during reconciliation | Rate > 0.1/s | + +## Accessing Components + +### OpenShift Console (Options 1 & 2) + +Navigate to **Observe → Metrics** and query: + +```promql +# Total sessions created +ambient_sessions_total + +# Session creation rate +rate(ambient_sessions_total[5m]) + +# p95 startup time +histogram_quantile(0.95, rate(ambient_session_startup_duration_bucket[5m])) + +# Error rate by namespace +sum by (namespace) (rate(ambient_session_errors[5m])) +``` + +### OTel Collector Logs + +```bash +kubectl logs -n ambient-code -l app=otel-collector -f +``` + +## Production Setup + +### Enable OpenShift User Workload Monitoring + +Check if enabled: +```bash +oc -n openshift-user-workload-monitoring get pod +``` + +If not: +```bash +oc apply -f - < 0 || (job.Status.Succeeded == 0 && job.Status.Failed == 0) { - log.Printf("Job %s is still active, cleaning up job and pods", jobName) - - // First, delete the job itself with foreground propagation - deletePolicy := v1.DeletePropagationForeground - err = config.K8sClient.BatchV1().Jobs(sessionNamespace).Delete(context.TODO(), jobName, v1.DeleteOptions{ - PropagationPolicy: &deletePolicy, - }) - if err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete job %s: %v", jobName, err) - } else { - log.Printf("Successfully deleted job %s for stopped session", jobName) - } + // Pod exists, delete it + log.Printf("Pod %s is still active, cleaning up pod", podName) - // Then, explicitly delete all pods for this job (by job-name label) - podSelector := fmt.Sprintf("job-name=%s", jobName) - log.Printf("Deleting pods with job-name selector: %s", podSelector) - err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{ - LabelSelector: podSelector, - }) - if err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete pods for job %s: %v (continuing anyway)", jobName, err) - } else { - log.Printf("Successfully deleted pods for job %s", jobName) - } + // Delete the pod + deletePolicy := v1.DeletePropagationForeground + err = config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), podName, v1.DeleteOptions{ + PropagationPolicy: &deletePolicy, + }) + if err != nil && !errors.IsNotFound(err) { + log.Printf("Failed to delete pod %s: %v", podName, err) + } else { + log.Printf("Successfully deleted pod %s for stopped session", podName) + } - // Also delete any pods labeled with this session (in case owner refs are lost) - sessionPodSelector := fmt.Sprintf("agentic-session=%s", name) - log.Printf("Deleting pods with agentic-session selector: %s", sessionPodSelector) - err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{ - LabelSelector: sessionPodSelector, - }) - if err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete session-labeled pods: %v (continuing anyway)", err) - } else { - log.Printf("Successfully deleted session-labeled pods") - } + // Also delete any other pods labeled with this session (in case owner refs are lost) + sessionPodSelector := fmt.Sprintf("agentic-session=%s", name) + log.Printf("Deleting pods with agentic-session selector: %s", sessionPodSelector) + err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{ + LabelSelector: sessionPodSelector, + }) + if err != nil && !errors.IsNotFound(err) { + log.Printf("Failed to delete session-labeled pods: %v (continuing anyway)", err) } else { - log.Printf("Job %s already completed (Succeeded: %d, Failed: %d), no cleanup needed", jobName, job.Status.Succeeded, job.Status.Failed) + log.Printf("Successfully deleted session-labeled pods") } } else if !errors.IsNotFound(err) { - log.Printf("Error checking job %s: %v", jobName, err) + log.Printf("Error checking pod %s: %v", podName, err) } else { - log.Printf("Job %s not found, already cleaned up", jobName) + log.Printf("Pod %s not found, already cleaned up", podName) } // Also cleanup ambient-vertex secret when session is stopped @@ -508,25 +422,25 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { // If in Creating phase, check if job exists if phase == "Creating" { - jobName := fmt.Sprintf("%s-job", name) - _, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{}) + podName := fmt.Sprintf("%s-runner", name) + _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{}) if err == nil { - // Job exists, start monitoring if not already running - monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName) - monitoredJobsMu.Lock() - alreadyMonitoring := monitoredJobs[monitorKey] + // Pod exists, start monitoring if not already running + monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName) + monitoredPodsMu.Lock() + alreadyMonitoring := monitoredPods[monitorKey] if !alreadyMonitoring { - monitoredJobs[monitorKey] = true - monitoredJobsMu.Unlock() - log.Printf("Resuming monitoring for existing job %s (session in Creating phase)", jobName) - go monitorJob(jobName, name, sessionNamespace) + monitoredPods[monitorKey] = true + monitoredPodsMu.Unlock() + log.Printf("Resuming monitoring for existing pod %s (session in Creating phase)", podName) + go monitorPod(podName, name, sessionNamespace) } else { - monitoredJobsMu.Unlock() - log.Printf("Job %s already being monitored, skipping duplicate", jobName) + monitoredPodsMu.Unlock() + log.Printf("Pod %s already being monitored, skipping duplicate", podName) } return nil } else if errors.IsNotFound(err) { - // Job doesn't exist but phase is Creating - check if this is due to a stop request + // Pod doesn't exist but phase is Creating - check if this is due to a stop request if desiredPhase == "Stopped" { // Job already gone, can transition directly to Stopped (skip Stopping phase) log.Printf("Session %s in Creating phase but job not found and stop requested, transitioning to Stopped", name) @@ -537,14 +451,14 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { Type: conditionReady, Status: "False", Reason: "UserStopped", - Message: "User requested stop during job creation", + Message: "User requested stop during pod creation", }) // Update progress-tracking conditions statusPatch.AddCondition(conditionUpdate{ - Type: conditionJobCreated, + Type: conditionPodCreated, Status: "False", Reason: "UserStopped", - Message: "Job deleted by user stop request", + Message: "Pod deleted by user stop request", }) statusPatch.AddCondition(conditionUpdate{ Type: conditionRunnerStarted, @@ -558,11 +472,11 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { return nil } - // Job doesn't exist but phase is Creating - this is inconsistent state + // Pod doesn't exist but phase is Creating - this is inconsistent state // Could happen if: - // 1. Job was manually deleted - // 2. Operator crashed between job creation and status update - // 3. Session is being stopped and job was deleted (stale event) + // 1. Pod was manually deleted + // 2. Operator crashed between pod creation and status update + // 3. Session is being stopped and pod was deleted (stale event) // Before recreating, verify the session hasn't been stopped // Fetch fresh status to check for recent state changes @@ -579,26 +493,26 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { freshStatus, _, _ := unstructured.NestedMap(freshObj.Object, "status") freshPhase, _, _ := unstructured.NestedString(freshStatus, "phase") if freshPhase == "Stopped" || freshPhase == "Stopping" || freshPhase == "Failed" || freshPhase == "Completed" { - log.Printf("Session %s is now in %s phase (stale Creating event), skipping job recreation", name, freshPhase) + log.Printf("Session %s is now in %s phase (stale Creating event), skipping pod recreation", name, freshPhase) return nil } } - log.Printf("Session %s in Creating phase but job not found, resetting to Pending and recreating", name) + log.Printf("Session %s in Creating phase but pod not found, resetting to Pending and recreating", name) statusPatch.SetField("phase", "Pending") statusPatch.AddCondition(conditionUpdate{ - Type: conditionJobCreated, + Type: conditionPodCreated, Status: "False", - Reason: "JobMissing", - Message: "Job not found, will recreate", + Reason: "PodMissing", + Message: "Pod not found, will recreate", }) // Apply immediately and continue to Pending logic _ = statusPatch.ApplyAndReset() - // Don't return - fall through to Pending logic to create job + // Don't return - fall through to Pending logic to create pod _ = "Pending" // phase reset handled by status update } else { - // Error checking job - log and continue - log.Printf("Error checking job for Creating session %s: %v, will attempt recovery", name, err) + // Error checking pod - log and continue + log.Printf("Error checking pod for Creating session %s: %v, will attempt recovery", name, err) // Fall through to Pending logic _ = "Pending" // phase reset handled by status update } @@ -620,90 +534,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { } } - // Determine PVC name and owner references - var pvcName string - var ownerRefs []v1.OwnerReference - reusingPVC := false - - if parentSessionID != "" { - // Continuation: reuse parent's PVC - pvcName = fmt.Sprintf("ambient-workspace-%s", parentSessionID) - reusingPVC = true - log.Printf("Session continuation: reusing PVC %s from parent session %s", pvcName, parentSessionID) - // No owner refs - we don't own the parent's PVC - } else { - // New session: create fresh PVC with owner refs - pvcName = fmt.Sprintf("ambient-workspace-%s", name) - ownerRefs = []v1.OwnerReference{ - { - APIVersion: "vteam.ambient-code/v1", - Kind: "AgenticSession", - Name: currentObj.GetName(), - UID: currentObj.GetUID(), - Controller: boolPtr(true), - // BlockOwnerDeletion intentionally omitted to avoid permission issues - }, - } - } - - // Ensure PVC exists (skip for continuation if parent's PVC should exist) - if !reusingPVC { - if err := services.EnsureSessionWorkspacePVC(sessionNamespace, pvcName, ownerRefs); err != nil { - log.Printf("Failed to ensure session PVC %s in %s: %v", pvcName, sessionNamespace, err) - statusPatch.AddCondition(conditionUpdate{ - Type: conditionPVCReady, - Status: "False", - Reason: "ProvisioningFailed", - Message: err.Error(), - }) - } else { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionPVCReady, - Status: "True", - Reason: "Bound", - Message: fmt.Sprintf("PVC %s ready", pvcName), - }) - } - } else { - // Verify parent's PVC exists - if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(sessionNamespace).Get(context.TODO(), pvcName, v1.GetOptions{}); err != nil { - log.Printf("Warning: Parent PVC %s not found for continuation session %s: %v", pvcName, name, err) - // Fall back to creating new PVC with current session's owner refs - pvcName = fmt.Sprintf("ambient-workspace-%s", name) - ownerRefs = []v1.OwnerReference{ - { - APIVersion: "vteam.ambient-code/v1", - Kind: "AgenticSession", - Name: currentObj.GetName(), - UID: currentObj.GetUID(), - Controller: boolPtr(true), - }, - } - if err := services.EnsureSessionWorkspacePVC(sessionNamespace, pvcName, ownerRefs); err != nil { - log.Printf("Failed to create fallback PVC %s: %v", pvcName, err) - statusPatch.AddCondition(conditionUpdate{ - Type: conditionPVCReady, - Status: "False", - Reason: "ProvisioningFailed", - Message: err.Error(), - }) - } else { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionPVCReady, - Status: "True", - Reason: "Bound", - Message: fmt.Sprintf("PVC %s ready", pvcName), - }) - } - } else { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionPVCReady, - Status: "True", - Reason: "Reused", - Message: fmt.Sprintf("Reused PVC %s from parent session", pvcName), - }) - } - } + // EmptyDir replaces PVC - session state persists in S3 + log.Printf("Session will use EmptyDir with S3 state persistence") // Load config for this session appConfig := config.LoadConfig() @@ -795,61 +627,49 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { log.Printf("Langfuse disabled, skipping secret copy") } - // CRITICAL: Delete temp content pod before creating Job to avoid PVC mount conflict - // The PVC is ReadWriteOnce, so only one pod can mount it at a time - tempPodName = fmt.Sprintf("temp-content-%s", name) - if _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{}); err == nil { - log.Printf("[PVCConflict] Deleting temp pod %s before creating Job (ReadWriteOnce PVC)", tempPodName) - - // Force immediate termination with zero grace period - gracePeriod := int64(0) - deleteOptions := v1.DeleteOptions{ - GracePeriodSeconds: &gracePeriod, - } - if err := config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), tempPodName, deleteOptions); err != nil && !errors.IsNotFound(err) { - log.Printf("[PVCConflict] Warning: failed to delete temp pod: %v", err) - } + // Create a Kubernetes Pod for this AgenticSession + podName := fmt.Sprintf("%s-runner", name) - // Wait for temp pod to fully terminate to prevent PVC mount conflicts - // This is critical because ReadWriteOnce PVCs cannot be mounted by multiple pods - // With gracePeriod=0, this should complete in 1-3 seconds - log.Printf("[PVCConflict] Waiting for temp pod %s to fully terminate...", tempPodName) - maxWaitSeconds := 10 // Reduced from 30 since we're force-deleting - for i := 0; i < maxWaitSeconds*4; i++ { // Poll 4x per second for faster detection - _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{}) - if errors.IsNotFound(err) { - elapsed := float64(i) * 0.25 - log.Printf("[PVCConflict] Temp pod fully terminated after %.2f seconds", elapsed) - break - } - if i == (maxWaitSeconds*4)-1 { - log.Printf("[PVCConflict] Warning: temp pod still exists after %d seconds, proceeding anyway", maxWaitSeconds) + // Ensure runner token exists before creating pod + // This handles cases where sessions are created directly via kubectl (bypassing the backend) + // or when the backend failed to provision the token + runnerTokenSecretName := fmt.Sprintf("ambient-runner-token-%s", name) + if _, err := config.K8sClient.CoreV1().Secrets(sessionNamespace).Get(context.TODO(), runnerTokenSecretName, v1.GetOptions{}); err != nil { + if errors.IsNotFound(err) { + log.Printf("Runner token secret %s not found, creating it now", runnerTokenSecretName) + if err := regenerateRunnerToken(sessionNamespace, name, currentObj); err != nil { + errMsg := fmt.Sprintf("Failed to provision runner token: %v", err) + log.Print(errMsg) + statusPatch.SetField("phase", "Failed") + statusPatch.AddCondition(conditionUpdate{ + Type: conditionReady, + Status: "False", + Reason: "TokenProvisionFailed", + Message: errMsg, + }) + _ = statusPatch.Apply() + return fmt.Errorf("failed to provision runner token for session %s: %v", name, err) } - time.Sleep(250 * time.Millisecond) // Poll every 250ms instead of 1s + log.Printf("Successfully provisioned runner token for session %s", name) + } else { + log.Printf("Warning: error checking runner token secret: %v", err) } - - // Clear temp pod annotations since we're starting the session - _ = clearAnnotation(sessionNamespace, name, tempContentRequestedAnnotation) - _ = clearAnnotation(sessionNamespace, name, tempContentLastAccessedAnnotation) } - // Create a Kubernetes Job for this AgenticSession - jobName := fmt.Sprintf("%s-job", name) - - // Check if job already exists in the session's namespace - _, err = config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{}) + // Check if pod already exists in the session's namespace + _, err = config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{}) if err == nil { - log.Printf("Job %s already exists for AgenticSession %s", jobName, name) + log.Printf("Pod %s already exists for AgenticSession %s", podName, name) statusPatch.SetField("phase", "Creating") statusPatch.SetField("observedGeneration", currentObj.GetGeneration()) statusPatch.AddCondition(conditionUpdate{ - Type: conditionJobCreated, + Type: conditionPodCreated, Status: "True", - Reason: "JobExists", - Message: "Runner job already exists", + Reason: "PodExists", + Message: "Runner pod already exists", }) _ = statusPatch.Apply() - // Clear desired-phase annotation if it exists (job already created) + // Clear desired-phase annotation if it exists (pod already created) _ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase") return nil } @@ -927,7 +747,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { var repos []RepoConfig - // Read simplified repos[] array format + // Read repos[] array format if reposArr, found, _ := unstructured.NestedSlice(spec, "repos"); found && len(reposArr) > 0 { repos = make([]RepoConfig, 0, len(reposArr)) for _, repoItem := range reposArr { @@ -946,34 +766,6 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { } } } - } else { - // Fallback to old format for backward compatibility (input/output structure) - inputRepo, _, _ := unstructured.NestedString(spec, "inputRepo") - inputBranch, _, _ := unstructured.NestedString(spec, "inputBranch") - if v, found, _ := unstructured.NestedString(spec, "input", "repo"); found && strings.TrimSpace(v) != "" { - inputRepo = v - } - if v, found, _ := unstructured.NestedString(spec, "input", "branch"); found && strings.TrimSpace(v) != "" { - inputBranch = v - } - if inputRepo != "" { - if inputBranch == "" { - inputBranch = "main" - } - repos = []RepoConfig{{ - URL: inputRepo, - Branch: inputBranch, - }} - } - } - - // Get first repo for backward compatibility env vars (first repo is always main repo) - var inputRepo, inputBranch, outputRepo, outputBranch string - if len(repos) > 0 { - inputRepo = repos[0].URL - inputBranch = repos[0].Branch - outputRepo = repos[0].URL // Output same as input in simplified format - outputBranch = repos[0].Branch } // Read autoPushOnComplete flag @@ -992,18 +784,45 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { } log.Printf("Session %s initiated by user: %s (userId: %s)", name, userName, userID) - // Create the Job - job := &batchv1.Job{ + // Get S3 configuration for this project (from project secret or operator defaults) + s3Endpoint, s3Bucket, s3AccessKey, s3SecretKey, err := getS3ConfigForProject(sessionNamespace, appConfig) + if err != nil { + log.Printf("Warning: S3 not available for project %s: %v (sessions will use ephemeral storage only)", sessionNamespace, err) + statusPatch.AddCondition(conditionUpdate{ + Type: "S3Available", + Status: "False", + Reason: "NotConfigured", + Message: fmt.Sprintf("S3 storage not configured: %v. Session state will not persist across pod restarts. Configure S3 in project settings.", err), + }) + // Set empty values - init-hydrate and state-sync will skip S3 operations + s3Endpoint = "" + s3Bucket = "" + s3AccessKey = "" + s3SecretKey = "" + } else { + log.Printf("S3 configured for project %s: endpoint=%s, bucket=%s", sessionNamespace, s3Endpoint, s3Bucket) + statusPatch.AddCondition(conditionUpdate{ + Type: "S3Available", + Status: "True", + Reason: "Configured", + Message: fmt.Sprintf("S3 storage configured: %s/%s", s3Endpoint, s3Bucket), + }) + } + + // Create the Pod directly (no Job wrapper for faster startup) + pod := &corev1.Pod{ ObjectMeta: v1.ObjectMeta{ - Name: jobName, + Name: podName, Namespace: sessionNamespace, Labels: map[string]string{ "agentic-session": name, "app": "ambient-code-runner", }, + // If you run a service mesh that injects sidecars and causes egress issues: + // Annotations: map[string]string{"sidecar.istio.io/inject": "false"}, OwnerReferences: []v1.OwnerReference{ { - APIVersion: "vteam.ambient-code/v1", + APIVersion: "vteam.ambient-code/v1alpha1", Kind: "AgenticSession", Name: currentObj.GetName(), UID: currentObj.GetUID(), @@ -1013,339 +832,418 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }, }, }, - Spec: batchv1.JobSpec{ - BackoffLimit: int32Ptr(3), - ActiveDeadlineSeconds: int64Ptr(14400), // 4 hour timeout for safety - // Auto-cleanup finished Jobs if TTL controller is enabled in the cluster - TTLSecondsAfterFinished: int32Ptr(600), - Template: corev1.PodTemplateSpec{ - ObjectMeta: v1.ObjectMeta{ - Labels: map[string]string{ - "agentic-session": name, - "app": "ambient-code-runner", + Spec: corev1.PodSpec{ + RestartPolicy: corev1.RestartPolicyNever, + TerminationGracePeriodSeconds: int64Ptr(30), // Allow time for state-sync final sync + // Explicitly set service account for pod creation permissions + AutomountServiceAccountToken: boolPtr(false), + Volumes: []corev1.Volume{ + { + Name: "workspace", + VolumeSource: corev1.VolumeSource{ + EmptyDir: &corev1.EmptyDirVolumeSource{ + SizeLimit: resource.NewQuantity(10*1024*1024*1024, resource.BinarySI), // 10Gi + }, }, - // If you run a service mesh that injects sidecars and causes egress issues for Jobs: - // Annotations: map[string]string{"sidecar.istio.io/inject": "false"}, }, - Spec: corev1.PodSpec{ - RestartPolicy: corev1.RestartPolicyNever, - // Explicitly set service account for pod creation permissions - AutomountServiceAccountToken: boolPtr(false), - Volumes: []corev1.Volume{ - { - Name: "workspace", - VolumeSource: corev1.VolumeSource{ - PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{ - ClaimName: pvcName, - }, - }, + }, + + // InitContainer to hydrate session state from S3 + InitContainers: []corev1.Container{ + { + Name: "init-hydrate", + Image: appConfig.StateSyncImage, + ImagePullPolicy: appConfig.ImagePullPolicy, + Command: []string{"/usr/local/bin/hydrate.sh"}, + SecurityContext: &corev1.SecurityContext{ + AllowPrivilegeEscalation: boolPtr(false), + ReadOnlyRootFilesystem: boolPtr(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, }, }, + Env: func() []corev1.EnvVar { + base := []corev1.EnvVar{ + {Name: "SESSION_NAME", Value: name}, + {Name: "NAMESPACE", Value: sessionNamespace}, + {Name: "S3_ENDPOINT", Value: s3Endpoint}, + {Name: "S3_BUCKET", Value: s3Bucket}, + {Name: "AWS_ACCESS_KEY_ID", Value: s3AccessKey}, + {Name: "AWS_SECRET_ACCESS_KEY", Value: s3SecretKey}, + {Name: "GIT_USER_NAME", Value: os.Getenv("GIT_USER_NAME")}, + {Name: "GIT_USER_EMAIL", Value: os.Getenv("GIT_USER_EMAIL")}, + } + + // Add repos JSON if present + if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 { + b, _ := json.Marshal(repos) + base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)}) + } + + // Add workflow info if present + if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok { + if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL}) + } + if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch}) + } + if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path}) + } + } + + // Add GitHub token for private repos + secretName := "" + if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok { + if anns, ok := meta["annotations"].(map[string]interface{}); ok { + if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" { + secretName = strings.TrimSpace(v) + } + } + } + if secretName == "" { + secretName = fmt.Sprintf("ambient-runner-token-%s", name) + } + base = append(base, corev1.EnvVar{ + Name: "BOT_TOKEN", + ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: secretName}, + Key: "k8s-token", + }}, + }) - // InitContainer to ensure workspace directory structure exists - InitContainers: []corev1.Container{ - { - Name: "init-workspace", - Image: "registry.access.redhat.com/ubi8/ubi-minimal:latest", - Command: []string{ - "sh", "-c", - fmt.Sprintf("mkdir -p /workspace/sessions/%s/workspace && chmod 777 /workspace/sessions/%s/workspace && echo 'Workspace initialized'", name, name), - }, - VolumeMounts: []corev1.VolumeMount{ - {Name: "workspace", MountPath: "/workspace"}, - }, - }, + return base + }(), + VolumeMounts: []corev1.VolumeMount{ + {Name: "workspace", MountPath: "/workspace"}, }, + }, + }, - // Flip roles so the content writer is the main container that keeps the pod alive - Containers: []corev1.Container{ - { - Name: "ambient-content", - Image: appConfig.ContentServiceImage, - ImagePullPolicy: appConfig.ImagePullPolicy, - Env: []corev1.EnvVar{ - {Name: "CONTENT_SERVICE_MODE", Value: "true"}, - {Name: "STATE_BASE_DIR", Value: "/workspace"}, - }, - Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}}, - ReadinessProbe: &corev1.Probe{ - ProbeHandler: corev1.ProbeHandler{ - HTTPGet: &corev1.HTTPGetAction{ - Path: "/health", - Port: intstr.FromString("http"), - }, - }, - InitialDelaySeconds: 5, - PeriodSeconds: 5, + // Flip roles so the content writer is the main container that keeps the pod alive + Containers: []corev1.Container{ + { + Name: "ambient-content", + Image: appConfig.ContentServiceImage, + ImagePullPolicy: appConfig.ImagePullPolicy, + Env: []corev1.EnvVar{ + {Name: "CONTENT_SERVICE_MODE", Value: "true"}, + {Name: "STATE_BASE_DIR", Value: "/workspace"}, + }, + Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}}, + ReadinessProbe: &corev1.Probe{ + ProbeHandler: corev1.ProbeHandler{ + HTTPGet: &corev1.HTTPGetAction{ + Path: "/health", + Port: intstr.FromString("http"), }, - VolumeMounts: []corev1.VolumeMount{{Name: "workspace", MountPath: "/workspace"}}, }, - { - Name: "ambient-code-runner", - Image: appConfig.AmbientCodeRunnerImage, - ImagePullPolicy: appConfig.ImagePullPolicy, - // 🔒 Container-level security (SCC-compatible, no privileged capabilities) - SecurityContext: &corev1.SecurityContext{ - AllowPrivilegeEscalation: boolPtr(false), - ReadOnlyRootFilesystem: boolPtr(false), // Playwright needs to write temp files - Capabilities: &corev1.Capabilities{ - Drop: []corev1.Capability{"ALL"}, // Drop all capabilities for security - }, - }, - - // Expose AG-UI server port for backend proxy - Ports: []corev1.ContainerPort{{ - Name: "agui", - ContainerPort: 8001, - Protocol: corev1.ProtocolTCP, - }}, - - VolumeMounts: []corev1.VolumeMount{ - {Name: "workspace", MountPath: "/workspace", ReadOnly: false}, - // Mount .claude directory for session state persistence - // This enables SDK's built-in resume functionality - {Name: "workspace", MountPath: "/app/.claude", SubPath: fmt.Sprintf("sessions/%s/.claude", name), ReadOnly: false}, - }, - - Env: func() []corev1.EnvVar { - base := []corev1.EnvVar{ - {Name: "DEBUG", Value: "true"}, - {Name: "INTERACTIVE", Value: fmt.Sprintf("%t", interactive)}, - {Name: "AGENTIC_SESSION_NAME", Value: name}, - {Name: "AGENTIC_SESSION_NAMESPACE", Value: sessionNamespace}, - // Provide session id and workspace path for the runner wrapper - {Name: "SESSION_ID", Value: name}, - {Name: "WORKSPACE_PATH", Value: fmt.Sprintf("/workspace/sessions/%s/workspace", name)}, - {Name: "ARTIFACTS_DIR", Value: "_artifacts"}, - // Google MCP credentials directory for workspace-mcp server (writable workspace location) - {Name: "GOOGLE_MCP_CREDENTIALS_DIR", Value: "/workspace/.google_workspace_mcp/credentials"}, - // Google OAuth client credentials for workspace-mcp - {Name: "GOOGLE_OAUTH_CLIENT_ID", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_ID")}, - {Name: "GOOGLE_OAUTH_CLIENT_SECRET", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_SECRET")}, - } + InitialDelaySeconds: 5, + PeriodSeconds: 5, + }, + VolumeMounts: []corev1.VolumeMount{{Name: "workspace", MountPath: "/workspace"}}, + }, + { + Name: "ambient-code-runner", + Image: appConfig.AmbientCodeRunnerImage, + ImagePullPolicy: appConfig.ImagePullPolicy, + // 🔒 Container-level security (SCC-compatible, no privileged capabilities) + SecurityContext: &corev1.SecurityContext{ + AllowPrivilegeEscalation: boolPtr(false), + ReadOnlyRootFilesystem: boolPtr(false), // Playwright needs to write temp files + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, // Drop all capabilities for security + }, + }, - // Add user context for observability and auditing (Langfuse userId, logs, etc.) - if userID != "" { - base = append(base, corev1.EnvVar{Name: "USER_ID", Value: userID}) - } - if userName != "" { - base = append(base, corev1.EnvVar{Name: "USER_NAME", Value: userName}) - } + // Expose AG-UI server port for backend proxy + Ports: []corev1.ContainerPort{{ + Name: "agui", + ContainerPort: 8001, + Protocol: corev1.ProtocolTCP, + }}, - // Add per-repo environment variables (simplified format) - for i, repo := range repos { - base = append(base, - corev1.EnvVar{Name: fmt.Sprintf("REPO_%d_URL", i), Value: repo.URL}, - corev1.EnvVar{Name: fmt.Sprintf("REPO_%d_BRANCH", i), Value: repo.Branch}, - ) - } + VolumeMounts: []corev1.VolumeMount{ + {Name: "workspace", MountPath: "/workspace", ReadOnly: false}, + // Mount .claude directory for session state persistence (synced to S3) + // This enables SDK's built-in resume functionality + {Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude", ReadOnly: false}, + }, - // Backward compatibility: set INPUT_REPO_URL/OUTPUT_REPO_URL from main repo - base = append(base, - corev1.EnvVar{Name: "INPUT_REPO_URL", Value: inputRepo}, - corev1.EnvVar{Name: "INPUT_BRANCH", Value: inputBranch}, - corev1.EnvVar{Name: "OUTPUT_REPO_URL", Value: outputRepo}, - corev1.EnvVar{Name: "OUTPUT_BRANCH", Value: outputBranch}, - corev1.EnvVar{Name: "INITIAL_PROMPT", Value: prompt}, - corev1.EnvVar{Name: "LLM_MODEL", Value: model}, - corev1.EnvVar{Name: "LLM_TEMPERATURE", Value: fmt.Sprintf("%.2f", temperature)}, - corev1.EnvVar{Name: "LLM_MAX_TOKENS", Value: fmt.Sprintf("%d", maxTokens)}, - corev1.EnvVar{Name: "USE_AGUI", Value: "true"}, - corev1.EnvVar{Name: "TIMEOUT", Value: fmt.Sprintf("%d", timeout)}, - corev1.EnvVar{Name: "AUTO_PUSH_ON_COMPLETE", Value: fmt.Sprintf("%t", autoPushOnComplete)}, - corev1.EnvVar{Name: "BACKEND_API_URL", Value: fmt.Sprintf("http://backend-service.%s.svc.cluster.local:8080/api", appConfig.BackendNamespace)}, - // LEGACY: WEBSOCKET_URL removed - runner now uses AG-UI server pattern (FastAPI) - // Backend proxies to runner's HTTP endpoint instead of WebSocket - ) - - // Platform-wide Langfuse observability configuration - // Uses secretKeyRef to prevent credential exposure in pod specs - // Secret is copied to session namespace from operator namespace - // All keys are optional to prevent pod startup failures if keys are missing - if ambientLangfuseSecretCopied { - base = append(base, - corev1.EnvVar{ - Name: "LANGFUSE_ENABLED", - ValueFrom: &corev1.EnvVarSource{ - SecretKeyRef: &corev1.SecretKeySelector{ - LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, - Key: "LANGFUSE_ENABLED", - Optional: boolPtr(true), - }, - }, + Env: func() []corev1.EnvVar { + base := []corev1.EnvVar{ + {Name: "DEBUG", Value: "true"}, + {Name: "INTERACTIVE", Value: fmt.Sprintf("%t", interactive)}, + {Name: "AGENTIC_SESSION_NAME", Value: name}, + {Name: "AGENTIC_SESSION_NAMESPACE", Value: sessionNamespace}, + // Provide session id and workspace path for the runner wrapper + {Name: "SESSION_ID", Value: name}, + {Name: "WORKSPACE_PATH", Value: "/workspace"}, + {Name: "ARTIFACTS_DIR", Value: "artifacts"}, + // Google MCP credentials directory for workspace-mcp server (writable workspace location) + {Name: "GOOGLE_MCP_CREDENTIALS_DIR", Value: "/workspace/.google_workspace_mcp/credentials"}, + // Google OAuth client credentials for workspace-mcp + {Name: "GOOGLE_OAUTH_CLIENT_ID", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_ID")}, + {Name: "GOOGLE_OAUTH_CLIENT_SECRET", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_SECRET")}, + } + + // Add user context for observability and auditing (Langfuse userId, logs, etc.) + if userID != "" { + base = append(base, corev1.EnvVar{Name: "USER_ID", Value: userID}) + } + if userName != "" { + base = append(base, corev1.EnvVar{Name: "USER_NAME", Value: userName}) + } + + // Core session env vars + base = append(base, + corev1.EnvVar{Name: "INITIAL_PROMPT", Value: prompt}, + corev1.EnvVar{Name: "LLM_MODEL", Value: model}, + corev1.EnvVar{Name: "LLM_TEMPERATURE", Value: fmt.Sprintf("%.2f", temperature)}, + corev1.EnvVar{Name: "LLM_MAX_TOKENS", Value: fmt.Sprintf("%d", maxTokens)}, + corev1.EnvVar{Name: "USE_AGUI", Value: "true"}, + corev1.EnvVar{Name: "TIMEOUT", Value: fmt.Sprintf("%d", timeout)}, + corev1.EnvVar{Name: "AUTO_PUSH_ON_COMPLETE", Value: fmt.Sprintf("%t", autoPushOnComplete)}, + corev1.EnvVar{Name: "BACKEND_API_URL", Value: fmt.Sprintf("http://backend-service.%s.svc.cluster.local:8080/api", appConfig.BackendNamespace)}, + // LEGACY: WEBSOCKET_URL removed - runner now uses AG-UI server pattern (FastAPI) + // Backend proxies to runner's HTTP endpoint instead of WebSocket + ) + + // Platform-wide Langfuse observability configuration + // Uses secretKeyRef to prevent credential exposure in pod specs + // Secret is copied to session namespace from operator namespace + // All keys are optional to prevent pod startup failures if keys are missing + if ambientLangfuseSecretCopied { + base = append(base, + corev1.EnvVar{ + Name: "LANGFUSE_ENABLED", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, + Key: "LANGFUSE_ENABLED", + Optional: boolPtr(true), }, - corev1.EnvVar{ - Name: "LANGFUSE_HOST", - ValueFrom: &corev1.EnvVarSource{ - SecretKeyRef: &corev1.SecretKeySelector{ - LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, - Key: "LANGFUSE_HOST", - Optional: boolPtr(true), - }, - }, + }, + }, + corev1.EnvVar{ + Name: "LANGFUSE_HOST", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, + Key: "LANGFUSE_HOST", + Optional: boolPtr(true), }, - corev1.EnvVar{ - Name: "LANGFUSE_PUBLIC_KEY", - ValueFrom: &corev1.EnvVarSource{ - SecretKeyRef: &corev1.SecretKeySelector{ - LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, - Key: "LANGFUSE_PUBLIC_KEY", - Optional: boolPtr(true), - }, - }, + }, + }, + corev1.EnvVar{ + Name: "LANGFUSE_PUBLIC_KEY", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, + Key: "LANGFUSE_PUBLIC_KEY", + Optional: boolPtr(true), }, - corev1.EnvVar{ - Name: "LANGFUSE_SECRET_KEY", - ValueFrom: &corev1.EnvVarSource{ - SecretKeyRef: &corev1.SecretKeySelector{ - LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, - Key: "LANGFUSE_SECRET_KEY", - Optional: boolPtr(true), - }, - }, + }, + }, + corev1.EnvVar{ + Name: "LANGFUSE_SECRET_KEY", + ValueFrom: &corev1.EnvVarSource{ + SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"}, + Key: "LANGFUSE_SECRET_KEY", + Optional: boolPtr(true), }, - ) - log.Printf("Langfuse env vars configured via secretKeyRef for session %s", name) + }, + }, + ) + log.Printf("Langfuse env vars configured via secretKeyRef for session %s", name) + } + + // Add Vertex AI configuration only if enabled + if vertexEnabled { + base = append(base, + corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "1"}, + corev1.EnvVar{Name: "CLOUD_ML_REGION", Value: os.Getenv("CLOUD_ML_REGION")}, + corev1.EnvVar{Name: "ANTHROPIC_VERTEX_PROJECT_ID", Value: os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID")}, + corev1.EnvVar{Name: "GOOGLE_APPLICATION_CREDENTIALS", Value: os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")}, + ) + } else { + // Explicitly set to 0 when Vertex is disabled + base = append(base, corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "0"}) + } + + // Add PARENT_SESSION_ID if this is a continuation + if parentSessionID != "" { + base = append(base, corev1.EnvVar{Name: "PARENT_SESSION_ID", Value: parentSessionID}) + log.Printf("Session %s: passing PARENT_SESSION_ID=%s to runner", name, parentSessionID) + } + + // Add IS_RESUME if this session has been started before + // Check status.startTime - if present, this is a resume (pod recreate/restart) + // This tells the runner to skip INITIAL_PROMPT and use continue_conversation + if status, found, _ := unstructured.NestedMap(currentObj.Object, "status"); found { + if startTime, ok := status["startTime"].(string); ok && startTime != "" { + base = append(base, corev1.EnvVar{Name: "IS_RESUME", Value: "true"}) + log.Printf("Session %s: marking as resume (IS_RESUME=true, startTime=%s)", name, startTime) + } + } + + // If backend annotated the session with a runner token secret, inject only BOT_TOKEN + // Secret contains: 'k8s-token' (for CR updates) + // Prefer annotated secret name; fallback to deterministic name + secretName := "" + if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok { + if anns, ok := meta["annotations"].(map[string]interface{}); ok { + if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" { + secretName = strings.TrimSpace(v) } - - // Add Vertex AI configuration only if enabled - if vertexEnabled { - base = append(base, - corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "1"}, - corev1.EnvVar{Name: "CLOUD_ML_REGION", Value: os.Getenv("CLOUD_ML_REGION")}, - corev1.EnvVar{Name: "ANTHROPIC_VERTEX_PROJECT_ID", Value: os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID")}, - corev1.EnvVar{Name: "GOOGLE_APPLICATION_CREDENTIALS", Value: os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")}, - ) - } else { - // Explicitly set to 0 when Vertex is disabled - base = append(base, corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "0"}) + } + } + if secretName == "" { + secretName = fmt.Sprintf("ambient-runner-token-%s", name) + } + base = append(base, corev1.EnvVar{ + Name: "BOT_TOKEN", + ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{ + LocalObjectReference: corev1.LocalObjectReference{Name: secretName}, + Key: "k8s-token", + }}, + }) + // Add CR-provided envs last (override base when same key) + if spec, ok := currentObj.Object["spec"].(map[string]interface{}); ok { + // Inject REPOS_JSON and MAIN_REPO_NAME from spec.repos and spec.mainRepoName if present + if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 { + // Use a minimal JSON serialization via fmt (we'll rely on client to pass REPOS_JSON too) + // This ensures runner gets repos even if env vars weren't passed from frontend + b, _ := json.Marshal(repos) + base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)}) + } + if mrn, ok := spec["mainRepoName"].(string); ok && strings.TrimSpace(mrn) != "" { + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_NAME", Value: mrn}) + } + // Inject MAIN_REPO_INDEX if provided + if mriRaw, ok := spec["mainRepoIndex"]; ok { + switch v := mriRaw.(type) { + case int64: + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) + case int32: + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) + case int: + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) + case float64: + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", int64(v))}) + case string: + if strings.TrimSpace(v) != "" { + base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: v}) + } } - - // Add PARENT_SESSION_ID if this is a continuation - if parentSessionID != "" { - base = append(base, corev1.EnvVar{Name: "PARENT_SESSION_ID", Value: parentSessionID}) - log.Printf("Session %s: passing PARENT_SESSION_ID=%s to runner", name, parentSessionID) + } + // Inject activeWorkflow environment variables if present + if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok { + if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL}) } - // If backend annotated the session with a runner token secret, inject only BOT_TOKEN - // Secret contains: 'k8s-token' (for CR updates) - // Prefer annotated secret name; fallback to deterministic name - secretName := "" - if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok { - if anns, ok := meta["annotations"].(map[string]interface{}); ok { - if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" { - secretName = strings.TrimSpace(v) - } - } + if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch}) } - if secretName == "" { - secretName = fmt.Sprintf("ambient-runner-token-%s", name) + if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" { + base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path}) } - base = append(base, corev1.EnvVar{ - Name: "BOT_TOKEN", - ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{ - LocalObjectReference: corev1.LocalObjectReference{Name: secretName}, - Key: "k8s-token", - }}, - }) - // Add CR-provided envs last (override base when same key) - if spec, ok := currentObj.Object["spec"].(map[string]interface{}); ok { - // Inject REPOS_JSON and MAIN_REPO_NAME from spec.repos and spec.mainRepoName if present - if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 { - // Use a minimal JSON serialization via fmt (we'll rely on client to pass REPOS_JSON too) - // This ensures runner gets repos even if env vars weren't passed from frontend - b, _ := json.Marshal(repos) - base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)}) - } - if mrn, ok := spec["mainRepoName"].(string); ok && strings.TrimSpace(mrn) != "" { - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_NAME", Value: mrn}) - } - // Inject MAIN_REPO_INDEX if provided - if mriRaw, ok := spec["mainRepoIndex"]; ok { - switch v := mriRaw.(type) { - case int64: - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) - case int32: - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) - case int: - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)}) - case float64: - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", int64(v))}) - case string: - if strings.TrimSpace(v) != "" { - base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: v}) + } + if envMap, ok := spec["environmentVariables"].(map[string]interface{}); ok { + for k, v := range envMap { + if vs, ok := v.(string); ok { + // replace if exists + replaced := false + for i := range base { + if base[i].Name == k { + base[i].Value = vs + replaced = true + break } } - } - // Inject activeWorkflow environment variables if present - if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok { - if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" { - base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL}) - } - if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" { - base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch}) - } - if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" { - base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path}) + if !replaced { + base = append(base, corev1.EnvVar{Name: k, Value: vs}) } } - if envMap, ok := spec["environmentVariables"].(map[string]interface{}); ok { - for k, v := range envMap { - if vs, ok := v.(string); ok { - // replace if exists - replaced := false - for i := range base { - if base[i].Name == k { - base[i].Value = vs - replaced = true - break - } - } - if !replaced { - base = append(base, corev1.EnvVar{Name: k, Value: vs}) - } - } - } - } - } - - return base - }(), - - // Import secrets as environment variables - // - integrationSecretsName: Only if exists (GIT_TOKEN, JIRA_*, custom keys) - // - runnerSecretsName: Only when Vertex disabled (ANTHROPIC_API_KEY) - // - ambient-langfuse-keys: Platform-wide Langfuse observability (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, LANGFUSE_ENABLED) - EnvFrom: func() []corev1.EnvFromSource { - sources := []corev1.EnvFromSource{} - - // Only inject integration secrets if they exist (optional) - if integrationSecretsExist { - sources = append(sources, corev1.EnvFromSource{ - SecretRef: &corev1.SecretEnvSource{ - LocalObjectReference: corev1.LocalObjectReference{Name: integrationSecretsName}, - }, - }) - log.Printf("Injecting integration secrets from '%s' for session %s", integrationSecretsName, name) - } else { - log.Printf("Skipping integration secrets '%s' for session %s (not found or not configured)", integrationSecretsName, name) - } - - // Only inject runner secrets (ANTHROPIC_API_KEY) when Vertex is disabled - if !vertexEnabled && runnerSecretsName != "" { - sources = append(sources, corev1.EnvFromSource{ - SecretRef: &corev1.SecretEnvSource{ - LocalObjectReference: corev1.LocalObjectReference{Name: runnerSecretsName}, - }, - }) - log.Printf("Injecting runner secrets from '%s' for session %s (Vertex disabled)", runnerSecretsName, name) - } else if vertexEnabled && runnerSecretsName != "" { - log.Printf("Skipping runner secrets '%s' for session %s (Vertex enabled)", runnerSecretsName, name) } + } + } + + return base + }(), + + // Import secrets as environment variables + // - integrationSecretsName: Only if exists (GIT_TOKEN, JIRA_*, custom keys) + // - runnerSecretsName: Only when Vertex disabled (ANTHROPIC_API_KEY) + // - ambient-langfuse-keys: Platform-wide Langfuse observability (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, LANGFUSE_ENABLED) + EnvFrom: func() []corev1.EnvFromSource { + sources := []corev1.EnvFromSource{} + + // Only inject integration secrets if they exist (optional) + if integrationSecretsExist { + sources = append(sources, corev1.EnvFromSource{ + SecretRef: &corev1.SecretEnvSource{ + LocalObjectReference: corev1.LocalObjectReference{Name: integrationSecretsName}, + }, + }) + log.Printf("Injecting integration secrets from '%s' for session %s", integrationSecretsName, name) + } else { + log.Printf("Skipping integration secrets '%s' for session %s (not found or not configured)", integrationSecretsName, name) + } + + // Only inject runner secrets (ANTHROPIC_API_KEY) when Vertex is disabled + if !vertexEnabled && runnerSecretsName != "" { + sources = append(sources, corev1.EnvFromSource{ + SecretRef: &corev1.SecretEnvSource{ + LocalObjectReference: corev1.LocalObjectReference{Name: runnerSecretsName}, + }, + }) + log.Printf("Injecting runner secrets from '%s' for session %s (Vertex disabled)", runnerSecretsName, name) + } else if vertexEnabled && runnerSecretsName != "" { + log.Printf("Skipping runner secrets '%s' for session %s (Vertex enabled)", runnerSecretsName, name) + } - return sources - }(), + return sources + }(), - Resources: corev1.ResourceRequirements{}, + Resources: corev1.ResourceRequirements{}, + }, + // S3 state-sync sidecar - syncs .claude/, artifacts/, uploads/ to S3 + { + Name: "state-sync", + Image: appConfig.StateSyncImage, + ImagePullPolicy: appConfig.ImagePullPolicy, + Command: []string{"/usr/local/bin/sync.sh"}, + SecurityContext: &corev1.SecurityContext{ + AllowPrivilegeEscalation: boolPtr(false), + ReadOnlyRootFilesystem: boolPtr(false), + Capabilities: &corev1.Capabilities{ + Drop: []corev1.Capability{"ALL"}, + }, + }, + Env: []corev1.EnvVar{ + {Name: "SESSION_NAME", Value: name}, + {Name: "NAMESPACE", Value: sessionNamespace}, + {Name: "S3_ENDPOINT", Value: s3Endpoint}, + {Name: "S3_BUCKET", Value: s3Bucket}, + {Name: "SYNC_INTERVAL", Value: "60"}, + {Name: "MAX_SYNC_SIZE", Value: "1073741824"}, // 1GB + {Name: "AWS_ACCESS_KEY_ID", Value: s3AccessKey}, + {Name: "AWS_SECRET_ACCESS_KEY", Value: s3SecretKey}, + }, + VolumeMounts: []corev1.VolumeMount{ + {Name: "workspace", MountPath: "/workspace", ReadOnly: false}, + }, + Resources: corev1.ResourceRequirements{ + Requests: corev1.ResourceList{ + corev1.ResourceCPU: resource.MustParse("50m"), + corev1.ResourceMemory: resource.MustParse("64Mi"), + }, + Limits: corev1.ResourceList{ + corev1.ResourceCPU: resource.MustParse("200m"), + corev1.ResourceMemory: resource.MustParse("256Mi"), }, }, }, @@ -1358,14 +1256,14 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { // If ambient-vertex secret was successfully copied, mount it as a volume if ambientVertexSecretCopied { - job.Spec.Template.Spec.Volumes = append(job.Spec.Template.Spec.Volumes, corev1.Volume{ + pod.Spec.Volumes = append(pod.Spec.Volumes, corev1.Volume{ Name: "vertex", VolumeSource: corev1.VolumeSource{Secret: &corev1.SecretVolumeSource{SecretName: types.AmbientVertexSecretName}}, }) // Mount to the ambient-code-runner container by name - for i := range job.Spec.Template.Spec.Containers { - if job.Spec.Template.Spec.Containers[i].Name == "ambient-code-runner" { - job.Spec.Template.Spec.Containers[i].VolumeMounts = append(job.Spec.Template.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{ + for i := range pod.Spec.Containers { + if pod.Spec.Containers[i].Name == "ambient-code-runner" { + pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{ Name: "vertex", MountPath: "/app/vertex", ReadOnly: true, @@ -1393,7 +1291,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }, OwnerReferences: []v1.OwnerReference{ { - APIVersion: "vteam.ambient-code/v1", + APIVersion: "vteam.ambient-code/v1alpha1", Kind: "AgenticSession", Name: currentObj.GetName(), UID: currentObj.GetUID(), @@ -1419,7 +1317,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { // Always mount Google OAuth secret (with Optional: true so pod starts even if empty) // K8s will sync updates when backend populates credentials after OAuth completion (~60s) - job.Spec.Template.Spec.Volumes = append(job.Spec.Template.Spec.Volumes, corev1.Volume{ + pod.Spec.Volumes = append(pod.Spec.Volumes, corev1.Volume{ Name: "google-oauth", VolumeSource: corev1.VolumeSource{ Secret: &corev1.SecretVolumeSource{ @@ -1429,9 +1327,9 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }, }) // Mount to the ambient-code-runner container - for i := range job.Spec.Template.Spec.Containers { - if job.Spec.Template.Spec.Containers[i].Name == "ambient-code-runner" { - job.Spec.Template.Spec.Containers[i].VolumeMounts = append(job.Spec.Template.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{ + for i := range pod.Spec.Containers { + if pod.Spec.Containers[i].Name == "ambient-code-runner" { + pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{ Name: "google-oauth", MountPath: "/app/.google_workspace_mcp/credentials", ReadOnly: true, @@ -1443,19 +1341,19 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { // Do not mount runner Secret volume; runner fetches tokens on demand - // Create the job - createdJob, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Create(context.TODO(), job, v1.CreateOptions{}) + // Create the pod + createdPod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Create(context.TODO(), pod, v1.CreateOptions{}) if err != nil { - // If job already exists, this is likely a race condition from duplicate watch events - not an error + // If pod already exists, this is likely a race condition from duplicate watch events - not an error if errors.IsAlreadyExists(err) { - log.Printf("Job %s already exists (race condition), continuing", jobName) - // Clear desired-phase annotation since job exists + log.Printf("Pod %s already exists (race condition), continuing", podName) + // Clear desired-phase annotation since pod exists _ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase") return nil } - log.Printf("Failed to create job %s: %v", jobName, err) + log.Printf("Failed to create pod %s: %v", podName, err) statusPatch.AddCondition(conditionUpdate{ - Type: conditionJobCreated, + Type: conditionPodCreated, Status: "False", Reason: "CreateFailed", Message: err.Error(), @@ -1463,54 +1361,54 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { statusPatch.AddCondition(conditionUpdate{ Type: conditionReady, Status: "False", - Reason: "JobCreationFailed", - Message: "Runner job creation failed", + Reason: "PodCreationFailed", + Message: "Runner pod creation failed", }) _ = statusPatch.Apply() - return fmt.Errorf("failed to create job: %v", err) + return fmt.Errorf("failed to create pod: %v", err) } - log.Printf("Created job %s for AgenticSession %s", jobName, name) + log.Printf("Created pod %s for AgenticSession %s", podName, name) statusPatch.SetField("phase", "Creating") statusPatch.SetField("observedGeneration", currentObj.GetGeneration()) statusPatch.AddCondition(conditionUpdate{ - Type: conditionJobCreated, + Type: conditionPodCreated, Status: "True", - Reason: "JobCreated", - Message: "Runner job created", + Reason: "PodCreated", + Message: "Runner pod created", }) // Apply all accumulated status changes in a single API call if err := statusPatch.Apply(); err != nil { log.Printf("Warning: failed to apply status patch: %v", err) } - // Clear desired-phase annotation now that job is created + // Clear desired-phase annotation now that pod is created // (This was deferred from the restart handler to avoid race conditions with stale events) _ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase") - log.Printf("[DesiredPhase] Cleared desired-phase annotation after successful job creation") + log.Printf("[DesiredPhase] Cleared desired-phase annotation after successful pod creation") - // Create a per-job Service pointing to the content container + // Create a per-pod Service pointing to the content container svc := &corev1.Service{ ObjectMeta: v1.ObjectMeta{ Name: fmt.Sprintf("ambient-content-%s", name), Namespace: sessionNamespace, Labels: map[string]string{"app": "ambient-code-runner", "agentic-session": name}, OwnerReferences: []v1.OwnerReference{{ - APIVersion: "batch/v1", - Kind: "Job", - Name: jobName, - UID: createdJob.UID, + APIVersion: "v1", + Kind: "Pod", + Name: podName, + UID: createdPod.UID, Controller: boolPtr(true), }}, }, Spec: corev1.ServiceSpec{ - Selector: map[string]string{"job-name": jobName}, + Selector: map[string]string{"agentic-session": name, "app": "ambient-code-runner"}, Ports: []corev1.ServicePort{{Port: 8080, TargetPort: intstr.FromString("http"), Protocol: corev1.ProtocolTCP, Name: "http"}}, Type: corev1.ServiceTypeClusterIP, }, } if _, serr := config.K8sClient.CoreV1().Services(sessionNamespace).Create(context.TODO(), svc, v1.CreateOptions{}); serr != nil && !errors.IsAlreadyExists(serr) { - log.Printf("Failed to create per-job content service for %s: %v", name, serr) + log.Printf("Failed to create per-pod content service for %s: %v", name, serr) } // Create AG-UI Service pointing to the runner's FastAPI server @@ -1524,16 +1422,16 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { "agentic-session": name, }, OwnerReferences: []v1.OwnerReference{{ - APIVersion: "batch/v1", - Kind: "Job", - Name: jobName, - UID: createdJob.UID, + APIVersion: "v1", + Kind: "Pod", + Name: podName, + UID: createdPod.UID, Controller: boolPtr(true), }}, }, Spec: corev1.ServiceSpec{ Type: corev1.ServiceTypeClusterIP, - Selector: map[string]string{"job-name": jobName}, + Selector: map[string]string{"agentic-session": name, "app": "ambient-code-runner"}, Ports: []corev1.ServicePort{{ Name: "agui", Protocol: corev1.ProtocolTCP, @@ -1548,17 +1446,17 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { log.Printf("Created AG-UI service session-%s for AgenticSession %s", name, name) } - // Start monitoring the job (only if not already being monitored) - monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName) - monitoredJobsMu.Lock() - alreadyMonitoring := monitoredJobs[monitorKey] + // Start monitoring the pod (only if not already being monitored) + monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName) + monitoredPodsMu.Lock() + alreadyMonitoring := monitoredPods[monitorKey] if !alreadyMonitoring { - monitoredJobs[monitorKey] = true - monitoredJobsMu.Unlock() - go monitorJob(jobName, name, sessionNamespace) + monitoredPods[monitorKey] = true + monitoredPodsMu.Unlock() + go monitorPod(podName, name, sessionNamespace) } else { - monitoredJobsMu.Unlock() - log.Printf("Job %s already being monitored, skipping duplicate goroutine", jobName) + monitoredPodsMu.Unlock() + log.Printf("Pod %s already being monitored, skipping duplicate goroutine", podName) } return nil @@ -1834,18 +1732,18 @@ func reconcileActiveWorkflowWithPatch(sessionNamespace, sessionName string, spec return nil } -func monitorJob(jobName, sessionName, sessionNamespace string) { - monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName) +func monitorPod(podName, sessionName, sessionNamespace string) { + monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName) // Remove from monitoring map when this goroutine exits defer func() { - monitoredJobsMu.Lock() - delete(monitoredJobs, monitorKey) - monitoredJobsMu.Unlock() - log.Printf("Stopped monitoring job %s (goroutine exiting)", jobName) + monitoredPodsMu.Lock() + delete(monitoredPods, monitorKey) + monitoredPodsMu.Unlock() + log.Printf("Stopped monitoring pod %s (goroutine exiting)", podName) }() - log.Printf("Starting job monitoring for %s (session: %s/%s)", jobName, sessionNamespace, sessionName) + log.Printf("Starting pod monitoring for %s (session: %s/%s)", podName, sessionNamespace, sessionName) ticker := time.NewTicker(5 * time.Second) defer ticker.Stop() @@ -1868,7 +1766,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) { sessionStatus, _, _ := unstructured.NestedMap(sessionObj.Object, "status") if sessionStatus != nil { if currentPhase, ok := sessionStatus["phase"].(string); ok && currentPhase == "Stopped" { - log.Printf("AgenticSession %s was stopped; stopping job monitoring", sessionName) + log.Printf("AgenticSession %s was stopped; stopping pod monitoring", sessionName) return } } @@ -1877,79 +1775,97 @@ func monitorJob(jobName, sessionName, sessionNamespace string) { log.Printf("Failed to refresh runner token for %s/%s: %v", sessionNamespace, sessionName, err) } - job, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{}) + pod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{}) if err != nil { if errors.IsNotFound(err) { - log.Printf("Job %s deleted; stopping monitor", jobName) + log.Printf("Pod %s deleted; stopping monitor", podName) return } - log.Printf("Error fetching job %s: %v", jobName, err) + log.Printf("Error fetching pod %s: %v", podName, err) continue } + // Note: We don't store pod name in status (pods are ephemeral, can be recreated) + // Use k8s-resources endpoint or kubectl for live pod info - pods, err := config.K8sClient.CoreV1().Pods(sessionNamespace).List(context.TODO(), v1.ListOptions{LabelSelector: fmt.Sprintf("job-name=%s", jobName)}) - if err != nil { - log.Printf("Failed to list pods for job %s: %v", jobName, err) - continue + if pod.Spec.NodeName != "" { + statusPatch.AddCondition(conditionUpdate{Type: conditionPodScheduled, Status: "True", Reason: "Scheduled", Message: fmt.Sprintf("Scheduled on %s", pod.Spec.NodeName)}) } - if job.Status.Succeeded > 0 { + if pod.Status.Phase == corev1.PodSucceeded { statusPatch.SetField("phase", "Completed") statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339)) statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "Completed", Message: "Session finished"}) _ = statusPatch.Apply() _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) + _ = deletePodAndPerPodService(sessionNamespace, podName, sessionName) return } - if job.Spec.BackoffLimit != nil && job.Status.Failed >= *job.Spec.BackoffLimit { - statusPatch.SetField("phase", "Failed") - statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339)) - statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "BackoffLimitExceeded", Message: "Runner failed repeatedly"}) - _ = statusPatch.Apply() - _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) - return - } + if pod.Status.Phase == corev1.PodFailed { + // Collect detailed error message from pod and containers + errorMsg := pod.Status.Message + if errorMsg == "" { + errorMsg = pod.Status.Reason + } - if len(pods.Items) == 0 { - if job.Status.Active == 0 && job.Status.Succeeded == 0 && job.Status.Failed == 0 { - statusPatch.SetField("phase", "Failed") - statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339)) - statusPatch.AddCondition(conditionUpdate{ - Type: conditionReady, - Status: "False", - Reason: "PodMissing", - Message: "Runner pod missing", - }) - _ = statusPatch.Apply() - _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) - return + // Check init containers for errors + for _, initStatus := range pod.Status.InitContainerStatuses { + if initStatus.State.Terminated != nil && initStatus.State.Terminated.ExitCode != 0 { + msg := fmt.Sprintf("Init container %s failed (exit %d): %s", + initStatus.Name, + initStatus.State.Terminated.ExitCode, + initStatus.State.Terminated.Message) + if initStatus.State.Terminated.Reason != "" { + msg = fmt.Sprintf("%s - %s", msg, initStatus.State.Terminated.Reason) + } + errorMsg = msg + break + } + if initStatus.State.Waiting != nil && initStatus.State.Waiting.Reason != "" { + errorMsg = fmt.Sprintf("Init container %s: %s - %s", + initStatus.Name, + initStatus.State.Waiting.Reason, + initStatus.State.Waiting.Message) + break + } } - continue - } - pod := pods.Items[0] - // Note: We don't store pod name in status (pods are ephemeral, can be recreated) - // Use k8s-resources endpoint or kubectl for live pod info + // Check main containers for errors if init passed + if errorMsg == "" || errorMsg == "PodFailed" { + for _, containerStatus := range pod.Status.ContainerStatuses { + if containerStatus.State.Terminated != nil && containerStatus.State.Terminated.ExitCode != 0 { + errorMsg = fmt.Sprintf("Container %s failed (exit %d): %s - %s", + containerStatus.Name, + containerStatus.State.Terminated.ExitCode, + containerStatus.State.Terminated.Reason, + containerStatus.State.Terminated.Message) + break + } + if containerStatus.State.Waiting != nil { + errorMsg = fmt.Sprintf("Container %s: %s - %s", + containerStatus.Name, + containerStatus.State.Waiting.Reason, + containerStatus.State.Waiting.Message) + break + } + } + } - if pod.Spec.NodeName != "" { - statusPatch.AddCondition(conditionUpdate{Type: conditionPodScheduled, Status: "True", Reason: "Scheduled", Message: fmt.Sprintf("Scheduled on %s", pod.Spec.NodeName)}) - } + if errorMsg == "" { + errorMsg = "Pod failed with unknown error" + } - if pod.Status.Phase == corev1.PodFailed { + log.Printf("Pod %s failed: %s", podName, errorMsg) statusPatch.SetField("phase", "Failed") statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339)) - statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "PodFailed", Message: pod.Status.Message}) + statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "PodFailed", Message: errorMsg}) _ = statusPatch.Apply() _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) + _ = deletePodAndPerPodService(sessionNamespace, podName, sessionName) return } - runner := getContainerStatusByName(&pod, "ambient-code-runner") + runner := getContainerStatusByName(pod, "ambient-code-runner") if runner == nil { // Apply any accumulated changes (e.g., PodScheduled) before continuing _ = statusPatch.Apply() @@ -1974,7 +1890,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) { statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: waiting.Reason, Message: msg}) _ = statusPatch.Apply() _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) + _ = deletePodAndPerPodService(sessionNamespace, podName, sessionName) return } } @@ -2008,7 +1924,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) { _ = statusPatch.Apply() _ = ensureSessionIsInteractive(sessionNamespace, sessionName) - _ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName) + _ = deletePodAndPerPodService(sessionNamespace, podName, sessionName) return } @@ -2027,31 +1943,101 @@ func getContainerStatusByName(pod *corev1.Pod, name string) *corev1.ContainerSta return nil } +// getS3ConfigForProject reads S3 configuration from project's integration secret +// Falls back to operator defaults if not configured +func getS3ConfigForProject(namespace string, appConfig *config.Config) (endpoint, bucket, accessKey, secretKey string, err error) { + // Try to read from project's ambient-non-vertex-integrations secret + secret, err := config.K8sClient.CoreV1().Secrets(namespace).Get(context.TODO(), "ambient-non-vertex-integrations", v1.GetOptions{}) + if err != nil && !errors.IsNotFound(err) { + return "", "", "", "", fmt.Errorf("failed to read project secret: %w", err) + } + + // Read from project secret if available + storageMode := "shared" // Default to shared cluster storage + if secret != nil && secret.Data != nil { + // Check storage mode (shared vs custom) + if mode := string(secret.Data["STORAGE_MODE"]); mode != "" { + storageMode = mode + } + + // Only read custom S3 settings if in custom mode + if storageMode == "custom" { + if val := string(secret.Data["S3_ENDPOINT"]); val != "" { + endpoint = val + } + if val := string(secret.Data["S3_BUCKET"]); val != "" { + bucket = val + } + if val := string(secret.Data["S3_ACCESS_KEY"]); val != "" { + accessKey = val + } + if val := string(secret.Data["S3_SECRET_KEY"]); val != "" { + secretKey = val + } + log.Printf("Using custom S3 configuration for project %s", namespace) + } else { + log.Printf("Using shared cluster storage (MinIO) for project %s", namespace) + } + } + + // Use operator defaults (for shared mode or as fallback) + if endpoint == "" { + endpoint = appConfig.S3Endpoint + } + if bucket == "" { + bucket = appConfig.S3Bucket + } + + // If credentials still empty AND using default endpoint/bucket, use shared MinIO credentials + // This implements "shared cluster storage" mode where users don't need to configure anything + usingDefaults := endpoint == appConfig.S3Endpoint && bucket == appConfig.S3Bucket + if (accessKey == "" || secretKey == "") && usingDefaults { + // Look for minio-credentials secret in operator namespace + minioSecret, err := config.K8sClient.CoreV1().Secrets(appConfig.BackendNamespace).Get(context.TODO(), "minio-credentials", v1.GetOptions{}) + if err == nil && minioSecret.Data != nil { + if accessKey == "" { + accessKey = string(minioSecret.Data["access-key"]) + } + if secretKey == "" { + secretKey = string(minioSecret.Data["secret-key"]) + } + log.Printf("Using shared MinIO credentials for project %s (shared cluster storage mode)", namespace) + } else { + log.Printf("Warning: minio-credentials secret not found in namespace %s", appConfig.BackendNamespace) + } + } + + // Validate we have required config + if endpoint == "" || bucket == "" { + return "", "", "", "", fmt.Errorf("incomplete S3 configuration - endpoint and bucket required") + } + if accessKey == "" || secretKey == "" { + return "", "", "", "", fmt.Errorf("incomplete S3 configuration - access key and secret key required") + } + + log.Printf("S3 config for project %s: endpoint=%s, bucket=%s", namespace, endpoint, bucket) + return endpoint, bucket, accessKey, secretKey, nil +} + // deleteJobAndPerJobService deletes the Job and its associated per-job Service -func deleteJobAndPerJobService(namespace, jobName, sessionName string) error { - // Delete Service first (it has ownerRef to Job, but delete explicitly just in case) +func deletePodAndPerPodService(namespace, podName, sessionName string) error { + // Delete Service first (it has ownerRef to Pod, but delete explicitly just in case) svcName := fmt.Sprintf("ambient-content-%s", sessionName) if err := config.K8sClient.CoreV1().Services(namespace).Delete(context.TODO(), svcName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete per-job service %s/%s: %v", namespace, svcName, err) + log.Printf("Failed to delete per-pod service %s/%s: %v", namespace, svcName, err) } - // Delete the Job with background propagation - policy := v1.DeletePropagationBackground - if err := config.K8sClient.BatchV1().Jobs(namespace).Delete(context.TODO(), jobName, v1.DeleteOptions{PropagationPolicy: &policy}); err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete job %s/%s: %v", namespace, jobName, err) - return err + // Delete AG-UI service + aguiSvcName := fmt.Sprintf("session-%s", sessionName) + if err := config.K8sClient.CoreV1().Services(namespace).Delete(context.TODO(), aguiSvcName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) { + log.Printf("Failed to delete AG-UI service %s/%s: %v", namespace, aguiSvcName, err) } - // Proactively delete Pods for this Job - if pods, err := config.K8sClient.CoreV1().Pods(namespace).List(context.TODO(), v1.ListOptions{LabelSelector: fmt.Sprintf("job-name=%s", jobName)}); err == nil { - for i := range pods.Items { - p := pods.Items[i] - if err := config.K8sClient.CoreV1().Pods(namespace).Delete(context.TODO(), p.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) { - log.Printf("Failed to delete pod %s/%s for job %s: %v", namespace, p.Name, jobName, err) - } - } - } else if !errors.IsNotFound(err) { - log.Printf("Failed to list pods for job %s/%s: %v", namespace, jobName, err) + // Delete the Pod with background propagation + policy := v1.DeletePropagationBackground + if err := config.K8sClient.CoreV1().Pods(namespace).Delete(context.TODO(), podName, v1.DeleteOptions{PropagationPolicy: &policy}); err != nil && !errors.IsNotFound(err) { + log.Printf("Failed to delete pod %s/%s: %v", namespace, podName, err) + return err } // Delete the ambient-vertex secret if it was copied by the operator @@ -2076,90 +2062,6 @@ func deleteJobAndPerJobService(namespace, jobName, sessionName string) error { return nil } -// CleanupExpiredTempContentPods removes temporary content pods that have exceeded their TTL -func CleanupExpiredTempContentPods() { - log.Println("Starting temp content pod cleanup goroutine") - for { - time.Sleep(1 * time.Minute) - - // List all temp content pods across all namespaces - pods, err := config.K8sClient.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{ - LabelSelector: "app=temp-content-service", - }) - if err != nil { - log.Printf("[TempPodCleanup] Failed to list temp content pods: %v", err) - continue - } - - gvr := types.GetAgenticSessionResource() - for _, pod := range pods.Items { - sessionName := pod.Labels["agentic-session"] - if sessionName == "" { - log.Printf("[TempPodCleanup] Temp pod %s has no agentic-session label, skipping", pod.Name) - continue - } - - // Check if session still exists - session, err := config.DynamicClient.Resource(gvr).Namespace(pod.Namespace).Get(context.TODO(), sessionName, v1.GetOptions{}) - if err != nil { - if errors.IsNotFound(err) { - // Session deleted, delete temp pod - log.Printf("[TempPodCleanup] Session %s/%s gone, deleting orphaned temp pod %s", pod.Namespace, sessionName, pod.Name) - if err := config.K8sClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) { - log.Printf("[TempPodCleanup] Failed to delete orphaned temp pod: %v", err) - } - } - continue - } - - // Get last-accessed timestamp from session annotation - annotations := session.GetAnnotations() - lastAccessedStr := annotations[tempContentLastAccessedAnnotation] - if lastAccessedStr == "" { - // Fall back to pod created-at if no last-accessed - lastAccessedStr = pod.Annotations["ambient-code.io/created-at"] - } - - if lastAccessedStr == "" { - log.Printf("[TempPodCleanup] No timestamp for temp pod %s, skipping", pod.Name) - continue - } - - lastAccessed, err := time.Parse(time.RFC3339, lastAccessedStr) - if err != nil { - log.Printf("[TempPodCleanup] Failed to parse timestamp for pod %s: %v", pod.Name, err) - continue - } - - // Delete if inactive for > 10 minutes - if time.Since(lastAccessed) > tempContentInactivityTTL { - log.Printf("[TempPodCleanup] Deleting inactive temp pod %s/%s (last accessed: %v ago)", - pod.Namespace, pod.Name, time.Since(lastAccessed)) - - if err := config.K8sClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) { - log.Printf("[TempPodCleanup] Failed to delete temp pod: %v", err) - continue - } - - // Update condition - _ = mutateAgenticSessionStatus(pod.Namespace, sessionName, func(status map[string]interface{}) { - setCondition(status, conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "False", - Reason: "Expired", - Message: fmt.Sprintf("Temp pod deleted due to inactivity (%v)", time.Since(lastAccessed)), - }) - }) - - // Clear temp-content-requested annotation - delete(annotations, tempContentRequestedAnnotation) - delete(annotations, tempContentLastAccessedAnnotation) - _ = updateAnnotations(pod.Namespace, sessionName, annotations) - } - } - } -} - // copySecretToNamespace copies a secret to a target namespace with owner references func copySecretToNamespace(ctx context.Context, sourceSecret *corev1.Secret, targetNamespace string, ownerObj *unstructured.Unstructured) error { // Check if secret already exists in target namespace @@ -2326,137 +2228,6 @@ func deleteAmbientLangfuseSecret(ctx context.Context, namespace string) error { return nil } -// reconcileTempContentPodWithPatch is a version of reconcileTempContentPod that uses StatusPatch for batched updates. -func reconcileTempContentPodWithPatch(sessionNamespace, sessionName, tempPodName string, session *unstructured.Unstructured, statusPatch *StatusPatch) error { - // Check if pod already exists - tempPod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{}) - - if errors.IsNotFound(err) { - // Create temp pod - log.Printf("[TempPod] Creating temp content pod for workspace access: %s/%s", sessionNamespace, tempPodName) - - pvcName := fmt.Sprintf("ambient-workspace-%s", sessionName) - appConfig := config.LoadConfig() - - pod := &corev1.Pod{ - ObjectMeta: v1.ObjectMeta{ - Name: tempPodName, - Namespace: sessionNamespace, - Labels: map[string]string{ - "app": "temp-content-service", - "agentic-session": sessionName, - }, - Annotations: map[string]string{ - "ambient-code.io/created-at": time.Now().UTC().Format(time.RFC3339), - }, - OwnerReferences: []v1.OwnerReference{{ - APIVersion: session.GetAPIVersion(), - Kind: session.GetKind(), - Name: session.GetName(), - UID: session.GetUID(), - Controller: boolPtr(true), - }}, - }, - Spec: corev1.PodSpec{ - RestartPolicy: corev1.RestartPolicyNever, - TerminationGracePeriodSeconds: int64Ptr(0), // Enable instant termination - Containers: []corev1.Container{{ - Name: "content", - Image: appConfig.ContentServiceImage, - ImagePullPolicy: appConfig.ImagePullPolicy, - Env: []corev1.EnvVar{ - {Name: "CONTENT_SERVICE_MODE", Value: "true"}, - {Name: "STATE_BASE_DIR", Value: "/workspace"}, - }, - Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}}, - VolumeMounts: []corev1.VolumeMount{{ - Name: "workspace", - MountPath: "/workspace", - }}, - ReadinessProbe: &corev1.Probe{ - ProbeHandler: corev1.ProbeHandler{ - HTTPGet: &corev1.HTTPGetAction{ - Path: "/health", - Port: intstr.FromString("http"), - }, - }, - InitialDelaySeconds: 3, - PeriodSeconds: 3, - }, - }}, - Volumes: []corev1.Volume{{ - Name: "workspace", - VolumeSource: corev1.VolumeSource{ - PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{ - ClaimName: pvcName, - }, - }, - }}, - }, - } - - if _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Create(context.TODO(), pod, v1.CreateOptions{}); err != nil { - log.Printf("[TempPod] Failed to create temp pod: %v", err) - statusPatch.AddCondition(conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "False", - Reason: "CreationFailed", - Message: fmt.Sprintf("Failed to create temp pod: %v", err), - }) - return fmt.Errorf("failed to create temp pod: %w", err) - } - - log.Printf("[TempPod] Created temp pod %s", tempPodName) - statusPatch.AddCondition(conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "Unknown", - Reason: "Provisioning", - Message: "Temp content pod starting", - }) - return nil - } - - if err != nil { - return fmt.Errorf("failed to check temp pod: %w", err) - } - - // Temp pod exists, check readiness - if tempPod.Status.Phase == corev1.PodRunning { - ready := false - for _, cond := range tempPod.Status.Conditions { - if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue { - ready = true - break - } - } - - if ready { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "True", - Reason: "Ready", - Message: "Temp content pod is ready for workspace access", - }) - } else { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "Unknown", - Reason: "NotReady", - Message: "Temp content pod not ready yet", - }) - } - } else if tempPod.Status.Phase == corev1.PodFailed { - statusPatch.AddCondition(conditionUpdate{ - Type: conditionTempContentPodReady, - Status: "False", - Reason: "PodFailed", - Message: fmt.Sprintf("Temp content pod failed: %s", tempPod.Status.Message), - }) - } - - return nil -} - // LEGACY: getBackendAPIURL removed - AG-UI migration // Workflow and repo changes now call runner's REST endpoints directly @@ -2632,6 +2403,5 @@ func regenerateRunnerToken(sessionNamespace, sessionName string, session *unstru // Helper functions var ( boolPtr = func(b bool) *bool { return &b } - int32Ptr = func(i int32) *int32 { return &i } int64Ptr = func(i int64) *int64 { return &i } ) diff --git a/components/operator/internal/services/infrastructure.go b/components/operator/internal/services/infrastructure.go index bed30920a..e33481f89 100644 --- a/components/operator/internal/services/infrastructure.go +++ b/components/operator/internal/services/infrastructure.go @@ -51,36 +51,10 @@ func EnsureContentService(namespace string) error { return nil } -// EnsureSessionWorkspacePVC creates a per-session PVC owned by the AgenticSession to avoid multi-attach conflicts +// EnsureSessionWorkspacePVC is deprecated - sessions now use EmptyDir with S3 state persistence +// Kept for backward compatibility but returns nil immediately func EnsureSessionWorkspacePVC(namespace, pvcName string, ownerRefs []v1.OwnerReference) error { - // Check if PVC exists - if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(namespace).Get(context.TODO(), pvcName, v1.GetOptions{}); err == nil { - return nil - } else if !errors.IsNotFound(err) { - return err - } - - pvc := &corev1.PersistentVolumeClaim{ - ObjectMeta: v1.ObjectMeta{ - Name: pvcName, - Namespace: namespace, - Labels: map[string]string{"app": "ambient-workspace", "agentic-session": pvcName}, - OwnerReferences: ownerRefs, - }, - Spec: corev1.PersistentVolumeClaimSpec{ - AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce}, - Resources: corev1.VolumeResourceRequirements{ - Requests: corev1.ResourceList{ - corev1.ResourceStorage: resource.MustParse("5Gi"), - }, - }, - }, - } - if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(namespace).Create(context.TODO(), pvc, v1.CreateOptions{}); err != nil { - if errors.IsAlreadyExists(err) { - return nil - } - return err - } + // DEPRECATED: Per-session PVCs have been replaced with EmptyDir + S3 state sync + // This function is kept for backward compatibility but does nothing return nil } diff --git a/components/operator/main.go b/components/operator/main.go index df9c31821..c71c12709 100644 --- a/components/operator/main.go +++ b/components/operator/main.go @@ -1,16 +1,28 @@ package main import ( + "context" + "flag" "log" "os" + "strconv" + + "k8s.io/apimachinery/pkg/runtime" + utilruntime "k8s.io/apimachinery/pkg/util/runtime" + clientgoscheme "k8s.io/client-go/kubernetes/scheme" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/healthz" + ctrllog "sigs.k8s.io/controller-runtime/pkg/log" + "sigs.k8s.io/controller-runtime/pkg/log/zap" + metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server" "ambient-code-operator/internal/config" + "ambient-code-operator/internal/controller" "ambient-code-operator/internal/handlers" "ambient-code-operator/internal/preflight" ) // Build-time metadata (set via -ldflags -X during build) -// These are embedded directly in the binary, so they're always accurate var ( GitCommit = "unknown" GitBranch = "unknown" @@ -18,49 +30,157 @@ var ( BuildDate = "unknown" ) -func logBuildInfo() { - log.Println("==============================================") - log.Println("Agentic Session Operator - Build Information") - log.Println("==============================================") - log.Printf("Version: %s", GitVersion) - log.Printf("Commit: %s", GitCommit) - log.Printf("Branch: %s", GitBranch) - log.Printf("Repository: %s", getEnvOrDefault("GIT_REPO", "unknown")) - log.Printf("Built: %s", BuildDate) - log.Printf("Built by: %s", getEnvOrDefault("BUILD_USER", "unknown")) - log.Println("==============================================") -} +var ( + scheme = runtime.NewScheme() +) -func getEnvOrDefault(key, defaultValue string) string { - if value := os.Getenv(key); value != "" { - return value - } - return defaultValue +func init() { + utilruntime.Must(clientgoscheme.AddToScheme(scheme)) } func main() { + // Parse command line flags + var metricsAddr string + var enableLeaderElection bool + var probeAddr string + var maxConcurrentReconciles int + var useLegacyWatch bool + + flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") + flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.") + flag.BoolVar(&enableLeaderElection, "leader-elect", false, + "Enable leader election for controller manager. "+ + "Enabling this will ensure there is only one active controller manager.") + flag.IntVar(&maxConcurrentReconciles, "max-concurrent-reconciles", 10, + "Maximum number of concurrent Reconciles which can be run. Higher values allow more throughput but consume more resources.") + flag.BoolVar(&useLegacyWatch, "legacy-watch", false, + "Use legacy watch-based implementation instead of controller-runtime (for debugging only).") + flag.Parse() + + // Allow environment variable override for max concurrent reconciles + if envVal := os.Getenv("MAX_CONCURRENT_RECONCILES"); envVal != "" { + if v, err := strconv.Atoi(envVal); err == nil && v > 0 { + maxConcurrentReconciles = v + } + } + + // Set up logging + opts := zap.Options{ + Development: os.Getenv("DEV_MODE") == "true", + } + ctrllog.SetLogger(zap.New(zap.UseFlagOptions(&opts))) + + logger := ctrllog.Log.WithName("setup") + // Log build information logBuildInfo() + logger.Info("Starting Agentic Session Operator", + "maxConcurrentReconciles", maxConcurrentReconciles, + "leaderElection", enableLeaderElection, + "legacyWatch", useLegacyWatch, + ) - // Initialize Kubernetes clients + // Initialize Kubernetes clients (needed for legacy handlers and config) if err := config.InitK8sClients(); err != nil { - log.Fatalf("Failed to initialize Kubernetes clients: %v", err) + logger.Error(err, "Failed to initialize Kubernetes clients") + os.Exit(1) } // Load application configuration appConfig := config.LoadConfig() - log.Printf("Agentic Session Operator starting in namespace: %s", appConfig.Namespace) - log.Printf("Using ambient-code runner image: %s", appConfig.AmbientCodeRunnerImage) + logger.Info("Configuration loaded", + "namespace", appConfig.Namespace, + "backendNamespace", appConfig.BackendNamespace, + "runnerImage", appConfig.AmbientCodeRunnerImage, + ) + + // Initialize OpenTelemetry metrics + shutdownMetrics, err := controller.InitMetrics(context.Background()) + if err != nil { + logger.Error(err, "Failed to initialize OpenTelemetry metrics, continuing without metrics") + } else { + defer shutdownMetrics() + } // Validate Vertex AI configuration at startup if enabled if os.Getenv("CLAUDE_CODE_USE_VERTEX") == "1" { if err := preflight.ValidateVertexConfig(appConfig.Namespace); err != nil { - log.Fatalf("Vertex AI validation failed: %v", err) + logger.Error(err, "Vertex AI validation failed") + os.Exit(1) } } - // Start watching AgenticSession resources + // If legacy watch mode is requested, use the old implementation + if useLegacyWatch { + logger.Info("Using legacy watch-based implementation") + runLegacyMode() + return + } + + // Create controller-runtime manager with increased QPS/Burst to avoid client-side throttling + // Default is QPS=5, Burst=10 which causes delays when handling multiple sessions + restConfig := ctrl.GetConfigOrDie() + restConfig.QPS = 100 + restConfig.Burst = 200 + + mgr, err := ctrl.NewManager(restConfig, ctrl.Options{ + Scheme: scheme, + Metrics: metricsserver.Options{BindAddress: metricsAddr}, + HealthProbeBindAddress: probeAddr, + LeaderElection: enableLeaderElection, + LeaderElectionID: "ambient-code-operator.ambient-code.io", + }) + if err != nil { + logger.Error(err, "Unable to create manager") + os.Exit(1) + } + + // Set up AgenticSession controller with concurrent reconcilers + agenticSessionReconciler := controller.NewAgenticSessionReconciler( + mgr.GetClient(), + maxConcurrentReconciles, + ) + if err := agenticSessionReconciler.SetupWithManager(mgr); err != nil { + logger.Error(err, "Unable to create AgenticSession controller") + os.Exit(1) + } + logger.Info("AgenticSession controller registered", + "maxConcurrentReconciles", maxConcurrentReconciles, + ) + + // Add health check endpoints + if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil { + logger.Error(err, "Unable to set up health check") + os.Exit(1) + } + if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil { + logger.Error(err, "Unable to set up ready check") + os.Exit(1) + } + + // Start namespace and project settings watchers (these remain as watch loops for now) + // Note: These could be migrated to controller-runtime controllers in the future + go handlers.WatchNamespaces() + go handlers.WatchProjectSettings() + + logger.Info("Starting manager with controller-runtime", + "maxConcurrentReconciles", maxConcurrentReconciles, + ) + + // Start the manager (blocks until stopped) + if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { + logger.Error(err, "Problem running manager") + os.Exit(1) + } +} + +// runLegacyMode runs the operator using the old watch-based implementation. +// This is kept for backward compatibility and debugging. +func runLegacyMode() { + log.Println("=== LEGACY MODE: Using watch-based implementation ===") + + // Start watching AgenticSession resources (legacy) go handlers.WatchAgenticSessions() // Start watching for managed namespaces @@ -69,9 +189,26 @@ func main() { // Start watching ProjectSettings resources go handlers.WatchProjectSettings() - // Start cleanup of expired temporary content pods - go handlers.CleanupExpiredTempContentPods() - // Keep the operator running select {} } + +func logBuildInfo() { + log.Println("==============================================") + log.Println("Agentic Session Operator - Build Information") + log.Println("==============================================") + log.Printf("Version: %s", GitVersion) + log.Printf("Commit: %s", GitCommit) + log.Printf("Branch: %s", GitBranch) + log.Printf("Repository: %s", getEnvOrDefault("GIT_REPO", "unknown")) + log.Printf("Built: %s", BuildDate) + log.Printf("Built by: %s", getEnvOrDefault("BUILD_USER", "unknown")) + log.Println("==============================================") +} + +func getEnvOrDefault(key, defaultValue string) string { + if value := os.Getenv(key); value != "" { + return value + } + return defaultValue +} diff --git a/components/runners/claude-code-runner/adapter.py b/components/runners/claude-code-runner/adapter.py index 419e493d2..ad8002f23 100644 --- a/components/runners/claude-code-runner/adapter.py +++ b/components/runners/claude-code-runner/adapter.py @@ -87,13 +87,11 @@ async def initialize(self, context: RunnerContext): # Copy Google OAuth credentials from mounted Secret to writable workspace location await self._setup_google_credentials() - # Prepare workspace from input repo if provided - async for event in self._prepare_workspace(): - yield event - - # Initialize workflow if ACTIVE_WORKFLOW env vars are set - async for event in self._initialize_workflow_if_set(): - yield event + # Workspace is already prepared by init container (hydrate.sh) + # - Repos cloned to /workspace/repos/ + # - Workflows cloned to /workspace/workflows/ + # - State hydrated from S3 to .claude/, artifacts/, file-uploads/ + logger.info("Workspace prepared by init container, validating...") # Validate prerequisite files exist for phase-based commands try: @@ -361,9 +359,11 @@ async def _run_claude_agent_sdk( ) obs._pending_initial_prompt = prompt - # Check if continuing from previous session - parent_session_id = self.context.get_env('PARENT_SESSION_ID', '').strip() - is_continuation = bool(parent_session_id) + # Check if this is a resume session via IS_RESUME env var + # This is set by the operator when restarting a stopped/completed/failed session + is_continuation = self.context.get_env('IS_RESUME', '').strip().lower() == 'true' + if is_continuation: + logger.info("IS_RESUME=true - treating as continuation") # Determine cwd and additional dirs repos_cfg = self._get_repos_config() @@ -898,160 +898,34 @@ async def _setup_vertex_credentials(self) -> dict: } async def _prepare_workspace(self) -> AsyncIterator[BaseEvent]: - """Clone input repo/branch into workspace and configure git remotes.""" + """Validate workspace prepared by init container. + + The init-hydrate container now handles: + - Downloading state from S3 (.claude/, artifacts/, file-uploads/) + - Cloning repos to /workspace/repos/ + - Cloning workflows to /workspace/workflows/ + + Runner just validates and logs what's ready. + """ workspace = Path(self.context.workspace_path) - workspace.mkdir(parents=True, exist_ok=True) - - parent_session_id = self.context.get_env('PARENT_SESSION_ID', '').strip() - reusing_workspace = bool(parent_session_id) - - logger.info(f"Workspace preparation: parent_session_id={parent_session_id[:8] if parent_session_id else 'None'}, reusing={reusing_workspace}") - - repos_cfg = self._get_repos_config() - if repos_cfg: - async for event in self._prepare_multi_repo_workspace(workspace, repos_cfg, reusing_workspace): - yield event - return - - # Single-repo legacy flow - input_repo = os.getenv("INPUT_REPO_URL", "").strip() - if not input_repo: - logger.info("No INPUT_REPO_URL configured, skipping single-repo setup") - return - - input_branch = os.getenv("INPUT_BRANCH", "").strip() or "main" - output_repo = os.getenv("OUTPUT_REPO_URL", "").strip() - - token = await self._fetch_token_for_url(input_repo) - workspace_has_git = (workspace / ".git").exists() - - try: - if not workspace_has_git: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": "📥 Cloning input repository..."} - ) - clone_url = self._url_with_token(input_repo, token) if token else input_repo - await self._run_cmd(["git", "clone", "--branch", input_branch, "--single-branch", clone_url, str(workspace)], cwd=str(workspace.parent)) - await self._run_cmd(["git", "remote", "set-url", "origin", clone_url], cwd=str(workspace), ignore_errors=True) - elif reusing_workspace: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": "✓ Preserving workspace (continuation)"} - ) - await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(input_repo, token) if token else input_repo], cwd=str(workspace), ignore_errors=True) - else: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": "🔄 Resetting workspace to clean state"} - ) - await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(input_repo, token) if token else input_repo], cwd=str(workspace)) - await self._run_cmd(["git", "fetch", "origin", input_branch], cwd=str(workspace)) - await self._run_cmd(["git", "checkout", input_branch], cwd=str(workspace)) - await self._run_cmd(["git", "reset", "--hard", f"origin/{input_branch}"], cwd=str(workspace)) - - # Git identity - user_name = os.getenv("GIT_USER_NAME", "").strip() or "Ambient Code Bot" - user_email = os.getenv("GIT_USER_EMAIL", "").strip() or "bot@ambient-code.local" - await self._run_cmd(["git", "config", "user.name", user_name], cwd=str(workspace)) - await self._run_cmd(["git", "config", "user.email", user_email], cwd=str(workspace)) - - if output_repo: - out_url = self._url_with_token(output_repo, token) if token else output_repo - await self._run_cmd(["git", "remote", "remove", "output"], cwd=str(workspace), ignore_errors=True) - await self._run_cmd(["git", "remote", "add", "output", out_url], cwd=str(workspace)) - - except Exception as e: - logger.error(f"Failed to prepare workspace: {e}") - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"Workspace preparation failed: {e}"} - ) - - # Create artifacts directory - try: - artifacts_dir = workspace / "artifacts" - artifacts_dir.mkdir(parents=True, exist_ok=True) - except Exception as e: - logger.warning(f"Failed to create artifacts directory: {e}") - - async def _prepare_multi_repo_workspace( - self, workspace: Path, repos_cfg: list, reusing_workspace: bool - ) -> AsyncIterator[BaseEvent]: - """Prepare workspace for multi-repo mode.""" - try: - for r in repos_cfg: - name = (r.get('name') or '').strip() - inp = r.get('input') or {} - url = (inp.get('url') or '').strip() - branch = (inp.get('branch') or '').strip() or 'main' - if not name or not url: - continue - - repo_dir = workspace / name - token = await self._fetch_token_for_url(url) - repo_exists = repo_dir.exists() and (repo_dir / ".git").exists() - - if not repo_exists: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"📥 Cloning {name}..."} - ) - clone_url = self._url_with_token(url, token) if token else url - await self._run_cmd(["git", "clone", "--branch", branch, "--single-branch", clone_url, str(repo_dir)], cwd=str(workspace)) - await self._run_cmd(["git", "remote", "set-url", "origin", clone_url], cwd=str(repo_dir), ignore_errors=True) - elif reusing_workspace: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"✓ Preserving {name} (continuation)"} - ) - await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(url, token) if token else url], cwd=str(repo_dir), ignore_errors=True) - else: - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"🔄 Resetting {name} to clean state"} - ) - await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(url, token) if token else url], cwd=str(repo_dir), ignore_errors=True) - await self._run_cmd(["git", "fetch", "origin", branch], cwd=str(repo_dir)) - await self._run_cmd(["git", "checkout", branch], cwd=str(repo_dir)) - await self._run_cmd(["git", "reset", "--hard", f"origin/{branch}"], cwd=str(repo_dir)) - - # Git identity - user_name = os.getenv("GIT_USER_NAME", "").strip() or "Ambient Code Bot" - user_email = os.getenv("GIT_USER_EMAIL", "").strip() or "bot@ambient-code.local" - await self._run_cmd(["git", "config", "user.name", user_name], cwd=str(repo_dir)) - await self._run_cmd(["git", "config", "user.email", user_email], cwd=str(repo_dir)) - - # Configure output remote - out = r.get('output') or {} - out_url_raw = (out.get('url') or '').strip() - if out_url_raw: - out_url = self._url_with_token(out_url_raw, token) if token else out_url_raw - await self._run_cmd(["git", "remote", "remove", "output"], cwd=str(repo_dir), ignore_errors=True) - await self._run_cmd(["git", "remote", "add", "output", out_url], cwd=str(repo_dir)) + logger.info(f"Validating workspace at {workspace}") + + # Check what was hydrated + hydrated_paths = [] + for path_name in [".claude", "artifacts", "file-uploads"]: + path_dir = workspace / path_name + if path_dir.exists(): + file_count = len([f for f in path_dir.rglob("*") if f.is_file()]) + if file_count > 0: + hydrated_paths.append(f"{path_name} ({file_count} files)") + + if hydrated_paths: + logger.info(f"Hydrated from S3: {', '.join(hydrated_paths)}") + else: + logger.info("No state hydrated (fresh session)") + + # No further preparation needed - init container did the work - except Exception as e: - logger.error(f"Failed to prepare multi-repo workspace: {e}") - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"Workspace preparation failed: {e}"} - ) async def _validate_prerequisites(self): """Validate prerequisite files exist for phase-based slash commands.""" @@ -1086,14 +960,11 @@ async def _validate_prerequisites(self): break async def _initialize_workflow_if_set(self) -> AsyncIterator[BaseEvent]: - """Initialize workflow on startup if ACTIVE_WORKFLOW env vars are set.""" + """Validate workflow was cloned by init container.""" active_workflow_url = (os.getenv('ACTIVE_WORKFLOW_GIT_URL') or '').strip() if not active_workflow_url: return - active_workflow_branch = (os.getenv('ACTIVE_WORKFLOW_BRANCH') or 'main').strip() - active_workflow_path = (os.getenv('ACTIVE_WORKFLOW_PATH') or '').strip() - try: owner, repo, _ = self._parse_owner_repo(active_workflow_url) derived_name = repo or '' @@ -1105,79 +976,24 @@ async def _initialize_workflow_if_set(self) -> AsyncIterator[BaseEvent]: derived_name = (derived_name or '').removesuffix('.git').strip() if not derived_name: - logger.warning("Could not derive workflow name from URL, skipping initialization") + logger.warning("Could not derive workflow name from URL") return - workflow_dir = Path(self.context.workspace_path) / "workflows" / derived_name - - if workflow_dir.exists(): - logger.info(f"Workflow {derived_name} already exists, skipping initialization") - return - - logger.info(f"Initializing workflow {derived_name} from CR spec on startup") - async for event in self._clone_workflow_repository(active_workflow_url, active_workflow_branch, active_workflow_path, derived_name): - yield event + # Check for cloned workflow (init container uses -clone-temp suffix) + workspace = Path(self.context.workspace_path) + workflow_temp_dir = workspace / "workflows" / f"{derived_name}-clone-temp" + workflow_dir = workspace / "workflows" / derived_name + + if workflow_temp_dir.exists(): + logger.info(f"Workflow {derived_name} cloned by init container at {workflow_temp_dir.name}") + elif workflow_dir.exists(): + logger.info(f"Workflow {derived_name} available at {workflow_dir.name}") + else: + logger.warning(f"Workflow {derived_name} not found (init container may have failed to clone)") except Exception as e: - logger.error(f"Failed to initialize workflow on startup: {e}") + logger.error(f"Failed to validate workflow: {e}") - async def _clone_workflow_repository( - self, git_url: str, branch: str, path: str, workflow_name: str - ) -> AsyncIterator[BaseEvent]: - """Clone workflow repository.""" - workspace = Path(self.context.workspace_path) - workflow_dir = workspace / "workflows" / workflow_name - temp_clone_dir = workspace / "workflows" / f"{workflow_name}-clone-temp" - - if workflow_dir.exists(): - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"✓ Workflow {workflow_name} already loaded"} - ) - return - - token = await self._fetch_token_for_url(git_url) - - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"📥 Cloning workflow {workflow_name}..."} - ) - - clone_url = self._url_with_token(git_url, token) if token else git_url - await self._run_cmd(["git", "clone", "--branch", branch, "--single-branch", clone_url, str(temp_clone_dir)], cwd=str(workspace)) - - if path and path.strip(): - subdir_path = temp_clone_dir / path.strip() - if subdir_path.exists() and subdir_path.is_dir(): - shutil.copytree(subdir_path, workflow_dir) - shutil.rmtree(temp_clone_dir) - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"✓ Extracted workflow from: {path}"} - ) - else: - temp_clone_dir.rename(workflow_dir) - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"⚠️ Path '{path}' not found, using full repository"} - ) - else: - temp_clone_dir.rename(workflow_dir) - - yield RawEvent( - type=EventType.RAW, - thread_id=self._current_thread_id or self.context.session_id, - run_id=self._current_run_id or "init", - event={"type": "system_log", "message": f"✅ Workflow {workflow_name} ready"} - ) async def _run_cmd(self, cmd, cwd=None, capture_stdout=False, ignore_errors=False): """Run a subprocess command asynchronously.""" diff --git a/components/runners/claude-code-runner/main.py b/components/runners/claude-code-runner/main.py index 7f14b1663..afbbefaed 100644 --- a/components/runners/claude-code-runner/main.py +++ b/components/runners/claude-code-runner/main.py @@ -97,17 +97,20 @@ async def lifespan(app: FastAPI): logger.info("Adapter initialized - fresh client will be created for each run") - # Check if this is a continuation (has parent session) - # PARENT_SESSION_ID is set when continuing from another session - parent_session_id = os.getenv("PARENT_SESSION_ID", "").strip() + # Check if this is a resume session via IS_RESUME env var + # This is set by the operator when restarting a stopped/completed/failed session + is_resume = os.getenv("IS_RESUME", "").strip().lower() == "true" + if is_resume: + logger.info("IS_RESUME=true - this is a resumed session, will skip INITIAL_PROMPT") - # Check for INITIAL_PROMPT and auto-execute (only if no parent session) + # Check for INITIAL_PROMPT and auto-execute (only if not a resume) initial_prompt = os.getenv("INITIAL_PROMPT", "").strip() - if initial_prompt and not parent_session_id: - logger.info(f"INITIAL_PROMPT detected ({len(initial_prompt)} chars), will auto-execute after 3s delay") + if initial_prompt and not is_resume: + delay = os.getenv("INITIAL_PROMPT_DELAY_SECONDS", "1") + logger.info(f"INITIAL_PROMPT detected ({len(initial_prompt)} chars), will auto-execute after {delay}s delay") asyncio.create_task(auto_execute_initial_prompt(initial_prompt, session_id)) - elif initial_prompt: - logger.info(f"INITIAL_PROMPT detected but has parent session ({parent_session_id[:12]}...) - skipping") + elif initial_prompt and is_resume: + logger.info("INITIAL_PROMPT detected but IS_RESUME=true - skipping (this is a resume)") logger.info(f"AG-UI server ready for session {session_id}") @@ -120,17 +123,19 @@ async def lifespan(app: FastAPI): async def auto_execute_initial_prompt(prompt: str, session_id: str): """Auto-execute INITIAL_PROMPT by POSTing to backend after short delay. - The 3-second delay gives the runner time to fully start. Backend has retry - logic to handle if Service DNS isn't ready yet. + The delay gives the runner service time to register in DNS. Backend has retry + logic to handle if Service DNS isn't ready yet, so this can be short. - Only called for fresh sessions (no PARENT_SESSION_ID set). + Only called for fresh sessions (no hydrated state in .claude/). """ import uuid import aiohttp - # Give runner time to fully start before backend tries to reach us - logger.info("Waiting 3s before auto-executing INITIAL_PROMPT (allow Service DNS to propagate)...") - await asyncio.sleep(3) + # Configurable delay (default 1s, was 3s) + # Backend has retry logic, so we don't need to wait long + delay_seconds = float(os.getenv("INITIAL_PROMPT_DELAY_SECONDS", "1")) + logger.info(f"Waiting {delay_seconds}s before auto-executing INITIAL_PROMPT (allow Service DNS to propagate)...") + await asyncio.sleep(delay_seconds) logger.info("Auto-executing INITIAL_PROMPT via backend POST...") diff --git a/components/runners/state-sync/Dockerfile b/components/runners/state-sync/Dockerfile new file mode 100644 index 000000000..b0214ff6a --- /dev/null +++ b/components/runners/state-sync/Dockerfile @@ -0,0 +1,21 @@ +FROM alpine:3.19 + +# Install rclone, git, and utilities +RUN apk add --no-cache \ + rclone \ + git \ + bash \ + curl \ + jq \ + ca-certificates + +# Copy scripts +COPY hydrate.sh /usr/local/bin/hydrate.sh +COPY sync.sh /usr/local/bin/sync.sh + +# Make scripts executable +RUN chmod +x /usr/local/bin/hydrate.sh /usr/local/bin/sync.sh + +# Default to sync.sh (used by sidecar) +ENTRYPOINT ["/usr/local/bin/sync.sh"] + diff --git a/components/runners/state-sync/hydrate.sh b/components/runners/state-sync/hydrate.sh new file mode 100644 index 000000000..165f198c4 --- /dev/null +++ b/components/runners/state-sync/hydrate.sh @@ -0,0 +1,232 @@ +#!/bin/bash +# hydrate.sh - Init container script to download session state from S3 + +set -e + +# Configuration from environment +S3_ENDPOINT="${S3_ENDPOINT:-http://minio.ambient-code.svc:9000}" +S3_BUCKET="${S3_BUCKET:-ambient-sessions}" +NAMESPACE="${NAMESPACE:-default}" +SESSION_NAME="${SESSION_NAME:-unknown}" + +# Sanitize inputs to prevent path traversal +NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" +SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}" + +# Paths to sync (must match sync.sh) +SYNC_PATHS=( + ".claude" + "artifacts" + "file-uploads" +) + +# Error handler +error_exit() { + echo "ERROR: $1" >&2 + exit 1 +} + +# Configure rclone for S3 +setup_rclone() { + # Use explicit /tmp path since HOME may not be set in container + mkdir -p /tmp/.config/rclone || error_exit "Failed to create rclone config directory" + cat > /tmp/.config/rclone/rclone.conf << EOF +[s3] +type = s3 +provider = Other +access_key_id = ${AWS_ACCESS_KEY_ID} +secret_access_key = ${AWS_SECRET_ACCESS_KEY} +endpoint = ${S3_ENDPOINT} +acl = private +EOF + if [ $? -ne 0 ]; then + error_exit "Failed to write rclone configuration" + fi + # Protect config file with credentials + chmod 600 /tmp/.config/rclone/rclone.conf || error_exit "Failed to secure rclone config" +} + +echo "=========================================" +echo "Ambient Code Session State Hydration" +echo "=========================================" +echo "Session: ${NAMESPACE}/${SESSION_NAME}" +echo "S3 Endpoint: ${S3_ENDPOINT}" +echo "S3 Bucket: ${S3_BUCKET}" +echo "=========================================" + +# Create workspace structure +echo "Creating workspace structure..." +mkdir -p /workspace/.claude || error_exit "Failed to create .claude directory" +mkdir -p /workspace/artifacts || error_exit "Failed to create artifacts directory" +mkdir -p /workspace/file-uploads || error_exit "Failed to create file-uploads directory" +mkdir -p /workspace/repos || error_exit "Failed to create repos directory" + +# Set permissions on created directories (not root workspace which may be owned by different user) +# Use 755 instead of 777 - readable by all, writable only by owner +chmod 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true + +# Check if S3 is configured +if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then + echo "S3 not configured - using ephemeral storage only (no state persistence)" + echo "=========================================" + exit 0 +fi + +# Setup rclone +echo "Setting up rclone..." +setup_rclone + +S3_PATH="s3:${S3_BUCKET}/${NAMESPACE}/${SESSION_NAME}" + +# Test S3 connection +echo "Testing S3 connection..." +if ! rclone --config /tmp/.config/rclone/rclone.conf lsd "s3:${S3_BUCKET}/" --max-depth 1 2>&1; then + error_exit "Failed to connect to S3 at ${S3_ENDPOINT}. Check endpoint and credentials." +fi +echo "S3 connection successful" + +# Check if session state exists in S3 +echo "Checking for existing session state in S3..." +if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/" 2>/dev/null | grep -q .; then + echo "Found existing session state, downloading from S3..." + + # Download each sync path if it exists + for path in "${SYNC_PATHS[@]}"; do + if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/${path}/" 2>/dev/null | grep -q .; then + echo " Downloading ${path}/..." + rclone --config /tmp/.config/rclone/rclone.conf copy "${S3_PATH}/${path}/" "/workspace/${path}/" \ + --copy-links \ + --transfers 8 \ + --fast-list \ + --progress 2>&1 || echo " Warning: failed to download ${path}" + else + echo " No data for ${path}/" + fi + done + + echo "State hydration complete!" +else + echo "No existing state found, starting fresh session" +fi + +# Set permissions on subdirectories (EmptyDir root may not be chmodable) +echo "Setting permissions on subdirectories..." +chmod -R 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true + +# ======================================== +# Clone repositories and workflows +# ======================================== +echo "=========================================" +echo "Setting up repositories and workflows..." +echo "=========================================" + +# Disable errexit for git clones (failures are non-fatal for private repos without auth) +set +e + +# Set HOME for git config (alpine doesn't set it by default) +export HOME=/tmp + +# Git identity +GIT_USER_NAME="${GIT_USER_NAME:-Ambient Code Bot}" +GIT_USER_EMAIL="${GIT_USER_EMAIL:-bot@ambient-code.local}" +git config --global user.name "$GIT_USER_NAME" || echo "Warning: failed to set git user.name" +git config --global user.email "$GIT_USER_EMAIL" || echo "Warning: failed to set git user.email" + +# Mark workspace as safe (in case runner needs it) +git config --global --add safe.directory /workspace 2>/dev/null || true + +# Clone repos from REPOS_JSON +if [ -n "$REPOS_JSON" ] && [ "$REPOS_JSON" != "null" ] && [ "$REPOS_JSON" != "" ]; then + echo "Cloning repositories from spec..." + # Parse JSON array and clone each repo + REPO_COUNT=$(echo "$REPOS_JSON" | jq -e 'if type == "array" then length else 0 end' 2>/dev/null || echo "0") + echo "Found $REPO_COUNT repositories to clone" + if [ "$REPO_COUNT" -gt 0 ]; then + i=0 + while [ $i -lt $REPO_COUNT ]; do + REPO_URL=$(echo "$REPOS_JSON" | jq -r ".[$i].url // empty" 2>/dev/null || echo "") + REPO_BRANCH=$(echo "$REPOS_JSON" | jq -r ".[$i].branch // \"main\"" 2>/dev/null || echo "main") + + # Derive repo name from URL + REPO_NAME=$(basename "$REPO_URL" .git 2>/dev/null || echo "") + + if [ -n "$REPO_NAME" ] && [ -n "$REPO_URL" ] && [ "$REPO_URL" != "null" ]; then + REPO_DIR="/workspace/repos/$REPO_NAME" + echo " Cloning $REPO_NAME (branch: $REPO_BRANCH)..." + + # Mark repo directory as safe + git config --global --add safe.directory "$REPO_DIR" 2>/dev/null || true + + # Clone repository (for private repos, runner will handle token injection) + if git clone --branch "$REPO_BRANCH" --single-branch "$REPO_URL" "$REPO_DIR" 2>&1; then + echo " ✓ Cloned $REPO_NAME" + else + echo " ⚠ Failed to clone $REPO_NAME (may require authentication)" + fi + fi + i=$((i + 1)) + done + fi +else + echo "No repositories configured in spec" +fi + +# Clone workflow repository +if [ -n "$ACTIVE_WORKFLOW_GIT_URL" ] && [ "$ACTIVE_WORKFLOW_GIT_URL" != "null" ]; then + WORKFLOW_BRANCH="${ACTIVE_WORKFLOW_BRANCH:-main}" + WORKFLOW_PATH="${ACTIVE_WORKFLOW_PATH:-}" + + echo "Cloning workflow repository..." + echo " URL: $ACTIVE_WORKFLOW_GIT_URL" + echo " Branch: $WORKFLOW_BRANCH" + if [ -n "$WORKFLOW_PATH" ]; then + echo " Subpath: $WORKFLOW_PATH" + fi + + # Derive workflow name from URL + WORKFLOW_NAME=$(basename "$ACTIVE_WORKFLOW_GIT_URL" .git) + WORKFLOW_FINAL="/workspace/workflows/${WORKFLOW_NAME}" + WORKFLOW_TEMP="/tmp/workflow-clone-$$" + + git config --global --add safe.directory "$WORKFLOW_FINAL" 2>/dev/null || true + + # Clone to temp location + if git clone --branch "$WORKFLOW_BRANCH" --single-branch "$ACTIVE_WORKFLOW_GIT_URL" "$WORKFLOW_TEMP" 2>&1; then + echo " Clone successful, processing..." + + # Extract subpath if specified + if [ -n "$WORKFLOW_PATH" ]; then + SUBPATH_FULL="$WORKFLOW_TEMP/$WORKFLOW_PATH" + echo " Checking for subpath: $SUBPATH_FULL" + ls -la "$SUBPATH_FULL" 2>&1 || echo " Subpath does not exist" + + if [ -d "$SUBPATH_FULL" ]; then + echo " Extracting subpath: $WORKFLOW_PATH" + mkdir -p "$(dirname "$WORKFLOW_FINAL")" + cp -r "$SUBPATH_FULL" "$WORKFLOW_FINAL" + rm -rf "$WORKFLOW_TEMP" + echo " ✓ Workflow extracted from subpath to /workspace/workflows/${WORKFLOW_NAME}" + else + echo " ⚠ Subpath '$WORKFLOW_PATH' not found in cloned repo" + echo " Available paths in repo:" + find "$WORKFLOW_TEMP" -maxdepth 3 -type d | head -10 + echo " Using entire repo instead" + mv "$WORKFLOW_TEMP" "$WORKFLOW_FINAL" + echo " ✓ Workflow ready at /workspace/workflows/${WORKFLOW_NAME}" + fi + else + # No subpath - use entire repo + mv "$WORKFLOW_TEMP" "$WORKFLOW_FINAL" + echo " ✓ Workflow ready at /workspace/workflows/${WORKFLOW_NAME}" + fi + else + echo " ⚠ Failed to clone workflow" + rm -rf "$WORKFLOW_TEMP" 2>/dev/null || true + fi +fi + +echo "=========================================" +echo "Workspace initialized successfully" +echo "=========================================" +exit 0 + diff --git a/components/runners/state-sync/sync.sh b/components/runners/state-sync/sync.sh new file mode 100644 index 000000000..05498ac5f --- /dev/null +++ b/components/runners/state-sync/sync.sh @@ -0,0 +1,156 @@ +#!/bin/bash +# sync.sh - Sidecar script to sync session state to S3 every N seconds + +set -e + +# Configuration from environment +S3_ENDPOINT="${S3_ENDPOINT:-http://minio.ambient-code.svc:9000}" +S3_BUCKET="${S3_BUCKET:-ambient-sessions}" +NAMESPACE="${NAMESPACE:-default}" +SESSION_NAME="${SESSION_NAME:-unknown}" +SYNC_INTERVAL="${SYNC_INTERVAL:-60}" +MAX_SYNC_SIZE="${MAX_SYNC_SIZE:-1073741824}" # 1GB default + +# Sanitize inputs to prevent path traversal +NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" +SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}" + +# Paths to sync (non-git content) +SYNC_PATHS=( + ".claude" + "artifacts" + "file-uploads" +) + +# Patterns to exclude from sync +EXCLUDE_PATTERNS=( + "repos/**" # Git handles this + "node_modules/**" + ".venv/**" + "__pycache__/**" + ".cache/**" + "*.pyc" + "target/**" + "dist/**" + "build/**" + ".git/**" + ".claude/debug/**" # Debug logs with symlinks that break rclone +) + +# Configure rclone for S3 +setup_rclone() { + # Use explicit /tmp path since HOME may not be set in container + mkdir -p /tmp/.config/rclone + cat > /tmp/.config/rclone/rclone.conf << EOF +[s3] +type = s3 +provider = Other +access_key_id = ${AWS_ACCESS_KEY_ID} +secret_access_key = ${AWS_SECRET_ACCESS_KEY} +endpoint = ${S3_ENDPOINT} +acl = private +EOF + # Protect config file with credentials + chmod 600 /tmp/.config/rclone/rclone.conf +} + +# Check total size before sync +check_size() { + local total=0 + for path in "${SYNC_PATHS[@]}"; do + if [ -d "/workspace/${path}" ]; then + size=$(du -sb "/workspace/${path}" 2>/dev/null | cut -f1 || echo 0) + total=$((total + size)) + fi + done + + if [ $total -gt $MAX_SYNC_SIZE ]; then + echo "WARNING: Sync size (${total} bytes) exceeds limit (${MAX_SYNC_SIZE} bytes)" + echo "Some files may be skipped" + return 1 + fi + return 0 +} + +# Sync workspace state to S3 +sync_to_s3() { + local s3_path="s3:${S3_BUCKET}/${NAMESPACE}/${SESSION_NAME}" + + echo "[$(date -Iseconds)] Starting sync to S3..." + + local synced=0 + for path in "${SYNC_PATHS[@]}"; do + if [ -d "/workspace/${path}" ]; then + echo " Syncing ${path}/..." + if rclone --config /tmp/.config/rclone/rclone.conf sync "/workspace/${path}" "${s3_path}/${path}/" \ + --checksum \ + --copy-links \ + --transfers 4 \ + --fast-list \ + --stats-one-line \ + --max-size ${MAX_SYNC_SIZE} \ + $(printf -- '--exclude %s ' "${EXCLUDE_PATTERNS[@]}") \ + 2>&1; then + synced=$((synced + 1)) + else + echo " Warning: sync of ${path} had errors" + fi + fi + done + + # Save metadata + echo "{\"lastSync\": \"$(date -Iseconds)\", \"session\": \"${SESSION_NAME}\", \"namespace\": \"${NAMESPACE}\", \"pathsSynced\": ${synced}}" > /tmp/metadata.json + rclone --config /tmp/.config/rclone/rclone.conf copy /tmp/metadata.json "${s3_path}/" 2>&1 || true + + echo "[$(date -Iseconds)] Sync complete (${synced} paths synced)" +} + +# Final sync on shutdown +final_sync() { + echo "" + echo "=========================================" + echo "[$(date -Iseconds)] SIGTERM received, performing final sync..." + echo "=========================================" + sync_to_s3 + echo "=========================================" + echo "[$(date -Iseconds)] Final sync complete, exiting" + echo "=========================================" + exit 0 +} + +# Main +echo "=========================================" +echo "Ambient Code State Sync Sidecar" +echo "=========================================" +echo "Session: ${NAMESPACE}/${SESSION_NAME}" +echo "S3 Endpoint: ${S3_ENDPOINT}" +echo "S3 Bucket: ${S3_BUCKET}" +echo "Sync interval: ${SYNC_INTERVAL}s" +echo "Max sync size: ${MAX_SYNC_SIZE} bytes" +echo "=========================================" + +# Check if S3 is configured +if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then + echo "S3 not configured - state sync disabled (ephemeral storage only)" + echo "Session will not persist across pod restarts" + echo "=========================================" + # Sleep forever - keep sidecar alive but do nothing + while true; do + sleep 3600 + done +fi + +setup_rclone +trap 'final_sync' SIGTERM SIGINT + +# Initial delay to let workspace populate +echo "Waiting 30s for workspace to populate..." +sleep 30 + +# Main sync loop +while true; do + check_size || echo "Size check warning (continuing anyway)" + sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..." + sleep ${SYNC_INTERVAL} +done + diff --git a/docs/minio-quickstart.md b/docs/minio-quickstart.md new file mode 100644 index 000000000..26fbe2a5c --- /dev/null +++ b/docs/minio-quickstart.md @@ -0,0 +1,297 @@ +# MinIO Quickstart for Ambient Code + +## Overview + +MinIO provides in-cluster S3-compatible storage for Ambient Code session state, artifacts, and uploads. This guide shows you how to deploy and configure MinIO. + +## Quick Setup + +### 1. Deploy MinIO + +```bash +# Create MinIO credentials secret +cd components/manifests/base +cp minio-credentials-secret.yaml.example minio-credentials-secret.yaml + +# Edit credentials (change admin/changeme123 to secure values) +vi minio-credentials-secret.yaml + +# Apply the secret +kubectl apply -f minio-credentials-secret.yaml -n ambient-code + +# MinIO deployment is included in base manifests, so deploy normally +make deploy NAMESPACE=ambient-code +``` + +### 2. Create Bucket + +```bash +# Run automated setup +make setup-minio NAMESPACE=ambient-code + +# Or manually: +kubectl port-forward svc/minio 9001:9001 -n ambient-code & +open http://localhost:9001 +# Login with credentials, create bucket "ambient-sessions" +``` + +### 3. Configure Project + +Navigate to project settings in the UI and configure: + +| Field | Value | +|-------|-------| +| **Enable S3 Storage** | ✅ Checked | +| **S3_ENDPOINT** | `http://minio.ambient-code.svc:9000` | +| **S3_BUCKET** | `ambient-sessions` | +| **S3_REGION** | `us-east-1` (not used by MinIO but required field) | +| **S3_ACCESS_KEY** | Your MinIO root user | +| **S3_SECRET_KEY** | Your MinIO root password | + +Click **Save Integration Secrets**. + +## Accessing MinIO Console + +### Option 1: Port Forward + +```bash +make minio-console NAMESPACE=ambient-code +# Opens at http://localhost:9001 +``` + +### Option 2: Create Route (OpenShift) + +```bash +oc create route edge minio-console \ + --service=minio \ + --port=9001 \ + -n ambient-code + +# Get URL +oc get route minio-console -n ambient-code -o jsonpath='{.spec.host}' +``` + +## Viewing Session Artifacts + +### Via MinIO Console + +1. Open MinIO console: `make minio-console` +2. Navigate to "Buckets" → "ambient-sessions" +3. Browse: `{namespace}/{session-name}/` + - `.claude/` - Session history + - `artifacts/` - Generated files + - `uploads/` - User uploads + +### Via MinIO Client (mc) + +```bash +# Install mc +brew install minio/stable/mc + +# Configure alias +kubectl port-forward svc/minio 9000:9000 -n ambient-code & +mc alias set ambient http://localhost:9000 admin changeme123 + +# List sessions +mc ls ambient/ambient-sessions/ + +# List session artifacts +mc ls ambient/ambient-sessions/my-project/session-abc/artifacts/ + +# Download artifacts +mc cp --recursive ambient/ambient-sessions/my-project/session-abc/artifacts/ ./local-dir/ + +# Download session history +mc cp --recursive ambient/ambient-sessions/my-project/session-abc/.claude/ ./.claude/ +``` + +### Via kubectl exec + +```bash +# Get MinIO pod +MINIO_POD=$(kubectl get pod -l app=minio -n ambient-code -o jsonpath='{.items[0].metadata.name}') + +# List sessions +kubectl exec -n ambient-code "${MINIO_POD}" -- mc ls local/ambient-sessions/ + +# Download file +kubectl exec -n ambient-code "${MINIO_POD}" -- mc cp "local/ambient-sessions/my-project/session-abc/artifacts/report.pdf" /tmp/ +kubectl cp "ambient-code/${MINIO_POD}:/tmp/report.pdf" ./report.pdf +``` + +## Management Commands + +```bash +# Check MinIO status +make minio-status NAMESPACE=ambient-code + +# View MinIO logs +make minio-logs NAMESPACE=ambient-code + +# Port forward to MinIO API (for mc commands) +kubectl port-forward svc/minio 9000:9000 -n ambient-code +``` + +## Bucket Lifecycle Management + +### Set Auto-Delete Policy + +Keep storage costs down by auto-deleting old sessions: + +```bash +# Create lifecycle policy +cat > /tmp/lifecycle.json << 'EOF' +{ + "Rules": [ + { + "ID": "expire-old-sessions", + "Status": "Enabled", + "Expiration": { + "Days": 30 + } + } + ] +} +EOF + +# Apply policy +kubectl exec -n ambient-code "${MINIO_POD}" -- mc ilm import "local/ambient-sessions" /tmp/lifecycle.json +``` + +### Monitor Storage Usage + +```bash +# Check bucket size +kubectl exec -n ambient-code "${MINIO_POD}" -- mc du local/ambient-sessions + +# List largest sessions +kubectl exec -n ambient-code "${MINIO_POD}" -- mc du --depth 2 local/ambient-sessions | sort -n -r | head -10 +``` + +## Backup and Restore + +### Backup MinIO Data + +```bash +# Backup to local directory +kubectl exec -n ambient-code "${MINIO_POD}" -- mc mirror local/ambient-sessions /tmp/backup/ +kubectl cp "ambient-code/${MINIO_POD}:/tmp/backup" ./minio-backup/ + +# Or use external mc client +mc mirror ambient/ambient-sessions ./minio-backup/ +``` + +### Restore from Backup + +```bash +# Copy backup to pod +kubectl cp ./minio-backup/ "ambient-code/${MINIO_POD}:/tmp/restore" + +# Restore +kubectl exec -n ambient-code "${MINIO_POD}" -- mc mirror /tmp/restore local/ambient-sessions +``` + +## Troubleshooting + +### MinIO Pod Not Starting + +```bash +# Check events +kubectl get events -n ambient-code --sort-by='.lastTimestamp' | grep minio + +# Check PVC +kubectl get pvc minio-data -n ambient-code + +# Check pod logs +kubectl logs -f deployment/minio -n ambient-code +``` + +### Can't Access MinIO Console + +```bash +# Check service +kubectl get svc minio -n ambient-code + +# Test connection from within cluster +kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n ambient-code -- \ + curl -v http://minio.ambient-code.svc:9000/minio/health/live +``` + +### Session Init Failing + +```bash +# Check session pod init container logs +kubectl logs {session-pod} -c init-hydrate -n {namespace} + +# Common issues: +# - Wrong S3 endpoint (check project settings) +# - Bucket doesn't exist (create in MinIO console) +# - Wrong credentials (verify in project settings) +``` + +## Production Considerations + +### High Availability + +For production, deploy MinIO in distributed mode: + +```bash +# Use MinIO Operator +kubectl apply -k "github.com/minio/operator" +kubectl apply -f - </dev/null 2>&1; then + MINIO_USER=$(kubectl get secret minio-credentials -n "${NAMESPACE}" -o jsonpath='{.data.root-user}' | base64 -d) + MINIO_PASSWORD=$(kubectl get secret minio-credentials -n "${NAMESPACE}" -o jsonpath='{.data.root-password}' | base64 -d) +else + echo "ERROR: minio-credentials secret not found in namespace ${NAMESPACE}" + echo "Please create it first:" + echo " 1. Copy components/manifests/base/minio-credentials-secret.yaml.example to minio-credentials-secret.yaml" + echo " 2. Edit with secure credentials" + echo " 3. kubectl apply -f minio-credentials-secret.yaml -n ${NAMESPACE}" + exit 1 +fi + +echo "=========================================" +echo "MinIO Setup for Ambient Code Platform" +echo "=========================================" +echo "Namespace: ${NAMESPACE}" +echo "Bucket: ${BUCKET_NAME}" +echo "=========================================" + +# Check if MinIO is deployed +echo "Checking MinIO deployment..." +if ! kubectl get deployment minio -n "${NAMESPACE}" >/dev/null 2>&1; then + echo "Error: MinIO deployment not found in namespace ${NAMESPACE}" + echo "Deploy MinIO first: kubectl apply -f components/manifests/base/minio-deployment.yaml" + exit 1 +fi + +# Wait for MinIO to be ready +echo "Waiting for MinIO to be ready..." +kubectl wait --for=condition=ready pod -l app=minio -n "${NAMESPACE}" --timeout=120s + +# Get MinIO pod name +MINIO_POD=$(kubectl get pod -l app=minio -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}') +echo "MinIO pod: ${MINIO_POD}" + +# Set up MinIO client alias +echo "Configuring MinIO client..." +kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc alias set local http://localhost:9000 "${MINIO_USER}" "${MINIO_PASSWORD}" + +# Create bucket if it doesn't exist +echo "Creating bucket: ${BUCKET_NAME}..." +if kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc ls "local/${BUCKET_NAME}" >/dev/null 2>&1; then + echo "Bucket ${BUCKET_NAME} already exists" +else + kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc mb "local/${BUCKET_NAME}" + echo "Created bucket: ${BUCKET_NAME}" +fi + +# Set bucket to private (default) +echo "Setting bucket policy..." +kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc anonymous set none "local/${BUCKET_NAME}" + +# Enable versioning (optional - helps with recovery) +echo "Enabling versioning..." +kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc version enable "local/${BUCKET_NAME}" + +# Show bucket info +echo "" +echo "=========================================" +echo "MinIO Setup Complete!" +echo "=========================================" +echo "Bucket: ${BUCKET_NAME}" +echo "Endpoint: http://minio.${NAMESPACE}.svc:9000" +echo "" +echo "MinIO Console Access:" +echo " kubectl port-forward svc/minio 9001:9001 -n ${NAMESPACE}" +echo " Then open: http://localhost:9001" +echo " Login: ${MINIO_USER} / ${MINIO_PASSWORD}" +echo "" +echo "Configure in Project Settings:" +echo " S3_ENDPOINT: http://minio.${NAMESPACE}.svc:9000" +echo " S3_BUCKET: ${BUCKET_NAME}" +echo " S3_ACCESS_KEY: ${MINIO_USER}" +echo " S3_SECRET_KEY: ${MINIO_PASSWORD}" +echo "=========================================" + From 7be7e66b34ef6b604aa0b28ef47804881e17212a Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Mon, 5 Jan 2026 17:21:35 -0600 Subject: [PATCH 2/6] feat: Enhance repository management and session handling - Implemented runtime cloning of repositories when added to a session, improving user experience by allowing immediate access to code. - Updated session handling to derive repository names from URLs, ensuring consistency in naming conventions. - Added user authentication and authorization validation for session-related API endpoints, enhancing security. - Improved frontend session detail page to conditionally display options and menus based on session status, streamlining user interaction. - Refactored backend code to remove legacy watch-based implementations, transitioning to a more efficient controller-runtime based approach for session management. --- components/backend/handlers/sessions.go | 73 ++++- .../[name]/sessions/[sessionName]/page.tsx | 48 +-- .../operator/internal/handlers/reconciler.go | 55 ---- .../operator/internal/handlers/sessions.go | 70 +--- components/operator/main.go | 31 +- .../runners/claude-code-runner/adapter.py | 22 +- components/runners/claude-code-runner/main.py | 299 +++++++++++++++++- components/runners/state-sync/hydrate.sh | 24 +- components/runners/state-sync/sync.sh | 31 +- 9 files changed, 459 insertions(+), 194 deletions(-) diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go index 591213de8..6af2681a8 100644 --- a/components/backend/handlers/sessions.go +++ b/components/backend/handlers/sessions.go @@ -2,6 +2,7 @@ package handlers import ( + "bytes" "context" "encoding/base64" "encoding/json" @@ -1276,6 +1277,52 @@ func AddRepo(c *gin.Context) { return } + // Derive repo name from URL + repoName := req.URL + if idx := strings.LastIndex(req.URL, "/"); idx != -1 { + repoName = req.URL[idx+1:] + } + repoName = strings.TrimSuffix(repoName, ".git") + + // Call runner to clone the repository (if session is running) + status, _ := item.Object["status"].(map[string]interface{}) + phase, _ := status["phase"].(string) + if phase == "Running" { + runnerURL := fmt.Sprintf("http://session-%s.%s.svc.cluster.local:8001/repos/add", sessionName, project) + runnerReq := map[string]string{ + "url": req.URL, + "branch": req.Branch, + "name": repoName, + } + reqBody, _ := json.Marshal(runnerReq) + + log.Printf("Calling runner to clone repo: %s -> %s", req.URL, runnerURL) + httpReq, err := http.NewRequestWithContext(c.Request.Context(), "POST", runnerURL, bytes.NewReader(reqBody)) + if err != nil { + log.Printf("Failed to create runner request: %v", err) + c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to create runner request"}) + return + } + httpReq.Header.Set("Content-Type", "application/json") + + client := &http.Client{Timeout: 120 * time.Second} // Allow time for clone + resp, err := client.Do(httpReq) + if err != nil { + log.Printf("Failed to call runner to clone repo: %v", err) + c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to clone repository (runner not reachable)"}) + return + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + body, _ := io.ReadAll(resp.Body) + log.Printf("Runner failed to clone repo (status %d): %s", resp.StatusCode, string(body)) + c.JSON(resp.StatusCode, gin.H{"error": fmt.Sprintf("Failed to clone repository: %s", string(body))}) + return + } + log.Printf("Runner successfully cloned repo %s for session %s", repoName, sessionName) + } + // Update spec.repos spec, ok := item.Object["spec"].(map[string]interface{}) if !ok { @@ -1315,7 +1362,7 @@ func AddRepo(c *gin.Context) { } log.Printf("Added repository %s to session %s in project %s", req.URL, sessionName, project) - c.JSON(http.StatusOK, gin.H{"message": "Repository added", "session": session}) + c.JSON(http.StatusOK, gin.H{"message": "Repository added", "name": repoName, "session": session}) } // RemoveRepo removes a repository from a running session @@ -1420,6 +1467,14 @@ func GetWorkflowMetadata(c *gin.Context) { return } + // Validate user authentication and authorization + reqK8s, _ := GetK8sClientsForRequest(c) + if reqK8s == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) + c.Abort() + return + } + // Get authorization token token := c.GetHeader("Authorization") if strings.TrimSpace(token) == "" { @@ -2209,6 +2264,14 @@ func ListSessionWorkspace(c *gin.Context) { return } + // Validate user authentication and authorization + reqK8s, _ := GetK8sClientsForRequest(c) + if reqK8s == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) + c.Abort() + return + } + rel := strings.TrimSpace(c.Query("path")) // Path is relative to content service's StateBaseDir (which is /workspace) // Content service handles the base path, so we just pass the relative path @@ -2285,6 +2348,14 @@ func GetSessionWorkspaceFile(c *gin.Context) { return } + // Validate user authentication and authorization + reqK8s, _ := GetK8sClientsForRequest(c) + if reqK8s == nil { + c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"}) + c.Abort() + return + } + sub := strings.TrimPrefix(c.Param("path"), "/") // Path is relative to content service's StateBaseDir (which is /workspace) absPath := sub diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx index da3c28e76..987230788 100644 --- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx +++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx @@ -359,13 +359,15 @@ export default function ProjectSessionDetailPage({ if (data.name && data.inputRepo) { try { + // Repos are cloned to /workspace/repos/{name} + const repoPath = `repos/${data.name}`; await fetch( `/api/projects/${projectName}/agentic-sessions/${sessionName}/git/configure-remote`, { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ - path: data.name, + path: repoPath, remoteUrl: data.inputRepo.url, branch: data.inputRepo.branch || "main", }), @@ -373,7 +375,7 @@ export default function ProjectSessionDetailPage({ ); const newRemotes = { ...directoryRemotes }; - newRemotes[data.name] = { + newRemotes[repoPath] = { url: data.inputRepo.url, branch: data.inputRepo.branch || "main", }; @@ -1382,36 +1384,39 @@ export default function ProjectSessionDetailPage({
- {/* Mobile: Options menu button (below header border) */} -
- -
+ {/* Mobile: Options menu button (below header border) - only show when session is running */} + {session?.status?.phase === "Running" && ( +
+ +
+ )} {/* Main content area */}
- {/* Mobile sidebar overlay */} - {mobileMenuOpen && ( + {/* Mobile sidebar overlay - only show when session is running */} + {session?.status?.phase === "Running" && mobileMenuOpen && (
setMobileMenuOpen(false)} /> )} - {/* Left Column - Accordions */} -
+ {/* Left Column - Accordions - only show when session is running */} + {session?.status?.phase === "Running" && ( +
{/* Mobile close button */}
+ )} {/* Right Column - Messages */}
diff --git a/components/operator/internal/handlers/reconciler.go b/components/operator/internal/handlers/reconciler.go index de10f5b42..f7982932e 100644 --- a/components/operator/internal/handlers/reconciler.go +++ b/components/operator/internal/handlers/reconciler.go @@ -393,58 +393,3 @@ func collectPodErrorMessage(pod *corev1.Pod) string { return errorMsg } - -// WatchAgenticSessionsLegacy is the original watch-based implementation. -// This is kept for backward compatibility during migration. -// DEPRECATED: Use controller-runtime based reconciliation instead. -func WatchAgenticSessionsLegacy() { - gvr := types.GetAgenticSessionResource() - - for { - // Watch AgenticSessions across all namespaces - watcher, err := config.DynamicClient.Resource(gvr).Watch(context.TODO(), v1.ListOptions{}) - if err != nil { - log.Printf("Failed to create AgenticSession watcher: %v", err) - time.Sleep(5 * time.Second) - continue - } - - log.Println("Watching for AgenticSession events across all namespaces...") - - for event := range watcher.ResultChan() { - // Reduced logging - only log errors and key events - switch event.Type { - case "ADDED", "MODIFIED": - obj := event.Object.(*unstructured.Unstructured) - - // Only process resources in managed namespaces - ns := obj.GetNamespace() - if ns == "" { - continue - } - nsObj, err := config.K8sClient.CoreV1().Namespaces().Get(context.TODO(), ns, v1.GetOptions{}) - if err != nil { - continue - } - if nsObj.Labels["ambient-code.io/managed"] != "true" { - continue - } - - // Remove the 100ms delay - controller-runtime handles debouncing - if err := handleAgenticSessionEvent(obj); err != nil { - log.Printf("Error handling AgenticSession event: %v", err) - } - case "DELETED": - obj := event.Object.(*unstructured.Unstructured) - log.Printf("AgenticSession %s/%s deleted", obj.GetNamespace(), obj.GetName()) - case "ERROR": - obj := event.Object.(*unstructured.Unstructured) - log.Printf("Watch error for AgenticSession: %v", obj) - } - } - - log.Println("AgenticSession watch channel closed, restarting...") - watcher.Stop() - time.Sleep(2 * time.Second) - } -} diff --git a/components/operator/internal/handlers/sessions.go b/components/operator/internal/handlers/sessions.go index a7d412a2d..867a09e68 100644 --- a/components/operator/internal/handlers/sessions.go +++ b/components/operator/internal/handlers/sessions.go @@ -30,73 +30,15 @@ import ( ) // Track which pods are currently being monitored to prevent duplicate goroutines +// NOTE: This is used by the legacy handleAgenticSessionEvent function which is +// kept for reference but no longer actively called by the operator. +// The controller-runtime based reconciler in internal/controller/ handles all +// AgenticSession reconciliation now. var ( monitoredPods = make(map[string]bool) monitoredPodsMu sync.Mutex ) -// WatchAgenticSessions watches for AgenticSession custom resources and creates pods -func WatchAgenticSessions() { - gvr := types.GetAgenticSessionResource() - - for { - // Watch AgenticSessions across all namespaces - watcher, err := config.DynamicClient.Resource(gvr).Watch(context.TODO(), v1.ListOptions{}) - if err != nil { - log.Printf("Failed to create AgenticSession watcher: %v", err) - time.Sleep(5 * time.Second) - continue - } - - log.Println("Watching for AgenticSession events across all namespaces...") - - for event := range watcher.ResultChan() { - switch event.Type { - case watch.Added, watch.Modified: - obj := event.Object.(*unstructured.Unstructured) - - // Only process resources in managed namespaces - ns := obj.GetNamespace() - if ns == "" { - continue - } - nsObj, err := config.K8sClient.CoreV1().Namespaces().Get(context.TODO(), ns, v1.GetOptions{}) - if err != nil { - log.Printf("Failed to get namespace %s: %v", ns, err) - continue - } - if nsObj.Labels["ambient-code.io/managed"] != "true" { - // Skip unmanaged namespaces - continue - } - - // Add small delay to avoid race conditions with rapid create/delete cycles - time.Sleep(100 * time.Millisecond) - - if err := handleAgenticSessionEvent(obj); err != nil { - log.Printf("Error handling AgenticSession event: %v", err) - } - case watch.Deleted: - obj := event.Object.(*unstructured.Unstructured) - sessionName := obj.GetName() - sessionNamespace := obj.GetNamespace() - log.Printf("AgenticSession %s/%s deleted", sessionNamespace, sessionName) - - // Cancel any ongoing job monitoring for this session - // (We could implement this with a context cancellation if needed) - // OwnerReferences handle cleanup of per-session resources - case watch.Error: - obj := event.Object.(*unstructured.Unstructured) - log.Printf("Watch error for AgenticSession: %v", obj) - } - } - - log.Println("AgenticSession watch channel closed, restarting...") - watcher.Stop() - time.Sleep(2 * time.Second) - } -} - func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { name := obj.GetName() sessionNamespace := obj.GetNamespace() @@ -917,6 +859,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }(), VolumeMounts: []corev1.VolumeMount{ {Name: "workspace", MountPath: "/workspace"}, + // SubPath mount for .claude so init container writes to same location as runner + {Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude"}, }, }, }, @@ -1235,6 +1179,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }, VolumeMounts: []corev1.VolumeMount{ {Name: "workspace", MountPath: "/workspace", ReadOnly: false}, + // SubPath mount for .claude so sync sidecar reads from same location as runner + {Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude", ReadOnly: false}, }, Resources: corev1.ResourceRequirements{ Requests: corev1.ResourceList{ diff --git a/components/operator/main.go b/components/operator/main.go index c71c12709..3eb47a231 100644 --- a/components/operator/main.go +++ b/components/operator/main.go @@ -44,7 +44,6 @@ func main() { var enableLeaderElection bool var probeAddr string var maxConcurrentReconciles int - var useLegacyWatch bool flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.") flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.") @@ -53,8 +52,6 @@ func main() { "Enabling this will ensure there is only one active controller manager.") flag.IntVar(&maxConcurrentReconciles, "max-concurrent-reconciles", 10, "Maximum number of concurrent Reconciles which can be run. Higher values allow more throughput but consume more resources.") - flag.BoolVar(&useLegacyWatch, "legacy-watch", false, - "Use legacy watch-based implementation instead of controller-runtime (for debugging only).") flag.Parse() // Allow environment variable override for max concurrent reconciles @@ -77,10 +74,9 @@ func main() { logger.Info("Starting Agentic Session Operator", "maxConcurrentReconciles", maxConcurrentReconciles, "leaderElection", enableLeaderElection, - "legacyWatch", useLegacyWatch, ) - // Initialize Kubernetes clients (needed for legacy handlers and config) + // Initialize Kubernetes clients (needed for namespace/projectsettings handlers and config) if err := config.InitK8sClients(); err != nil { logger.Error(err, "Failed to initialize Kubernetes clients") os.Exit(1) @@ -111,13 +107,6 @@ func main() { } } - // If legacy watch mode is requested, use the old implementation - if useLegacyWatch { - logger.Info("Using legacy watch-based implementation") - runLegacyMode() - return - } - // Create controller-runtime manager with increased QPS/Burst to avoid client-side throttling // Default is QPS=5, Burst=10 which causes delays when handling multiple sessions restConfig := ctrl.GetConfigOrDie() @@ -175,24 +164,6 @@ func main() { } } -// runLegacyMode runs the operator using the old watch-based implementation. -// This is kept for backward compatibility and debugging. -func runLegacyMode() { - log.Println("=== LEGACY MODE: Using watch-based implementation ===") - - // Start watching AgenticSession resources (legacy) - go handlers.WatchAgenticSessions() - - // Start watching for managed namespaces - go handlers.WatchNamespaces() - - // Start watching ProjectSettings resources - go handlers.WatchProjectSettings() - - // Keep the operator running - select {} -} - func logBuildInfo() { log.Println("==============================================") log.Println("Agentic Session Operator - Build Information") diff --git a/components/runners/claude-code-runner/adapter.py b/components/runners/claude-code-runner/adapter.py index ad8002f23..4b4e8c30f 100644 --- a/components/runners/claude-code-runner/adapter.py +++ b/components/runners/claude-code-runner/adapter.py @@ -790,11 +790,12 @@ def _setup_workflow_paths(self, active_workflow_url: str, repos_cfg: list) -> tu logger.warning(f"Failed to derive workflow name: {e}, using default") cwd_path = str(Path(self.context.workspace_path) / "workflows" / "default") - # Add all repos as additional directories + # Add all repos as additional directories (repos are in /workspace/repos/{name}) + repos_base = Path(self.context.workspace_path) / "repos" for r in repos_cfg: name = (r.get('name') or '').strip() if name: - repo_path = str(Path(self.context.workspace_path) / name) + repo_path = str(repos_base / name) if repo_path not in add_dirs: add_dirs.append(repo_path) @@ -810,8 +811,14 @@ def _setup_workflow_paths(self, active_workflow_url: str, repos_cfg: list) -> tu return cwd_path, add_dirs, derived_name def _setup_multi_repo_paths(self, repos_cfg: list) -> tuple[str, list]: - """Setup paths for multi-repo mode.""" + """Setup paths for multi-repo mode. + + Repos are cloned to /workspace/repos/{name} by both: + - hydrate.sh (init container) + - clone_repo_at_runtime() (runtime addition) + """ add_dirs = [] + repos_base = Path(self.context.workspace_path) / "repos" main_name = (os.getenv('MAIN_REPO_NAME') or '').strip() if not main_name: @@ -824,13 +831,15 @@ def _setup_multi_repo_paths(self, repos_cfg: list) -> tuple[str, list]: idx_val = 0 main_name = (repos_cfg[idx_val].get('name') or '').strip() - cwd_path = str(Path(self.context.workspace_path) / main_name) if main_name else self.context.workspace_path + # Main repo path is /workspace/repos/{name} + cwd_path = str(repos_base / main_name) if main_name else self.context.workspace_path for r in repos_cfg: name = (r.get('name') or '').strip() if not name: continue - p = str(Path(self.context.workspace_path) / name) + # All repos are in /workspace/repos/{name} + p = str(repos_base / name) if p != cwd_path: add_dirs.append(p) @@ -1273,9 +1282,10 @@ def _build_workspace_context_prompt(self, repos_cfg, workflow_name, artifacts_pa if repos_cfg: prompt += "## Available Code Repositories\n" + prompt += "Location: repos/\n" for i, repo in enumerate(repos_cfg): name = repo.get('name', f'repo-{i}') - prompt += f"- {name}/\n" + prompt += f"- repos/{name}/\n" prompt += "\nThese repositories contain source code you can read or modify.\n\n" if ambient_config.get("systemPrompt"): diff --git a/components/runners/claude-code-runner/main.py b/components/runners/claude-code-runner/main.py index afbbefaed..412ce70bd 100644 --- a/components/runners/claude-code-runner/main.py +++ b/components/runners/claude-code-runner/main.py @@ -227,12 +227,10 @@ async def event_generator(): try: logger.info("Event generator started") - # Initialize adapter on first run (yields setup events) + # Initialize adapter on first run if not _adapter_initialized: logger.info("First run - initializing adapter with workspace preparation") - async for event in adapter.initialize(context): - logger.debug(f"Yielding initialization event: {event.type}") - yield encoder.encode(event) + await adapter.initialize(context) logger.info("Adapter initialization complete") _adapter_initialized = True @@ -288,6 +286,105 @@ async def interrupt_run(): raise HTTPException(status_code=500, detail=str(e)) +async def clone_workflow_at_runtime(git_url: str, branch: str, subpath: str) -> tuple[bool, str]: + """ + Clone a workflow repository at runtime. + + This mirrors the logic in hydrate.sh but runs when workflows are changed + after the pod has started. + + Returns: + (success, workflow_dir_path) tuple + """ + import tempfile + import shutil + from pathlib import Path + + if not git_url: + return False, "" + + # Derive workflow name from URL + workflow_name = git_url.split("/")[-1].removesuffix(".git") + workspace_path = os.getenv("WORKSPACE_PATH", "/workspace") + workflow_final = Path(workspace_path) / "workflows" / workflow_name + + logger.info(f"Cloning workflow '{workflow_name}' from {git_url}@{branch}") + if subpath: + logger.info(f" Subpath: {subpath}") + + # Create temp directory for clone + temp_dir = Path(tempfile.mkdtemp(prefix="workflow-clone-")) + + try: + # Build git clone command with optional auth token + github_token = os.getenv("GITHUB_TOKEN", "").strip() + gitlab_token = os.getenv("GITLAB_TOKEN", "").strip() + + # Determine which token to use based on URL + clone_url = git_url + if github_token and "github" in git_url.lower(): + clone_url = git_url.replace("https://", f"https://x-access-token:{github_token}@") + logger.info("Using GITHUB_TOKEN for workflow authentication") + elif gitlab_token and "gitlab" in git_url.lower(): + clone_url = git_url.replace("https://", f"https://oauth2:{gitlab_token}@") + logger.info("Using GITLAB_TOKEN for workflow authentication") + + # Clone the repository + process = await asyncio.create_subprocess_exec( + "git", "clone", "--branch", branch, "--single-branch", "--depth", "1", + clone_url, str(temp_dir), + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE + ) + stdout, stderr = await process.communicate() + + if process.returncode != 0: + # Redact tokens from error message + error_msg = stderr.decode() + if github_token: + error_msg = error_msg.replace(github_token, "***REDACTED***") + if gitlab_token: + error_msg = error_msg.replace(gitlab_token, "***REDACTED***") + logger.error(f"Failed to clone workflow: {error_msg}") + return False, "" + + logger.info("Clone successful, processing...") + + # Handle subpath extraction + if subpath: + subpath_full = temp_dir / subpath + if subpath_full.exists() and subpath_full.is_dir(): + logger.info(f"Extracting subpath: {subpath}") + # Remove existing workflow dir if exists + if workflow_final.exists(): + shutil.rmtree(workflow_final) + # Create parent dirs and copy subpath + workflow_final.parent.mkdir(parents=True, exist_ok=True) + shutil.copytree(subpath_full, workflow_final) + logger.info(f"Workflow extracted to {workflow_final}") + else: + logger.warning(f"Subpath '{subpath}' not found, using entire repo") + if workflow_final.exists(): + shutil.rmtree(workflow_final) + shutil.move(str(temp_dir), str(workflow_final)) + else: + # No subpath - use entire repo + if workflow_final.exists(): + shutil.rmtree(workflow_final) + shutil.move(str(temp_dir), str(workflow_final)) + + logger.info(f"Workflow '{workflow_name}' ready at {workflow_final}") + return True, str(workflow_final) + + except Exception as e: + logger.error(f"Error cloning workflow: {e}") + return False, "" + finally: + # Cleanup temp directory if it still exists + if temp_dir.exists(): + shutil.rmtree(temp_dir, ignore_errors=True) + + @app.post("/workflow") async def change_workflow(request: Request): """ @@ -307,6 +404,13 @@ async def change_workflow(request: Request): logger.info(f"Workflow change request: {git_url}@{branch} (path: {path})") + # Clone the workflow repository at runtime + # This is needed because the init container only runs once at pod startup + if git_url: + success, workflow_path = await clone_workflow_at_runtime(git_url, branch, path) + if not success: + logger.warning("Failed to clone workflow, will use default workflow directory") + # Update environment variables os.environ["ACTIVE_WORKFLOW_GIT_URL"] = git_url os.environ["ACTIVE_WORKFLOW_BRANCH"] = branch @@ -320,12 +424,106 @@ async def change_workflow(request: Request): # Trigger a new run to greet user with workflow context # This runs in background via backend POST - import asyncio asyncio.create_task(trigger_workflow_greeting(git_url, branch, path)) return {"message": "Workflow updated", "gitUrl": git_url, "branch": branch, "path": path} +async def clone_repo_at_runtime(git_url: str, branch: str, name: str) -> tuple[bool, str]: + """ + Clone a repository at runtime. + + This mirrors the logic in hydrate.sh but runs when repos are added + after the pod has started. + + Args: + git_url: Git repository URL + branch: Branch to clone + name: Name for the cloned directory (derived from URL if empty) + + Returns: + (success, repo_dir_path) tuple + """ + import tempfile + import shutil + from pathlib import Path + + if not git_url: + return False, "" + + # Derive repo name from URL if not provided + if not name: + name = git_url.split("/")[-1].removesuffix(".git") + + # Repos are stored in /workspace/repos/{name} (matching hydrate.sh) + workspace_path = os.getenv("WORKSPACE_PATH", "/workspace") + repos_dir = Path(workspace_path) / "repos" + repos_dir.mkdir(parents=True, exist_ok=True) + repo_final = repos_dir / name + + logger.info(f"Cloning repo '{name}' from {git_url}@{branch}") + + # Skip if already cloned + if repo_final.exists(): + logger.info(f"Repo '{name}' already exists at {repo_final}, skipping clone") + return True, str(repo_final) + + # Create temp directory for clone + temp_dir = Path(tempfile.mkdtemp(prefix="repo-clone-")) + + try: + # Build git clone command with optional auth token + github_token = os.getenv("GITHUB_TOKEN", "").strip() + gitlab_token = os.getenv("GITLAB_TOKEN", "").strip() + + # Determine which token to use based on URL + clone_url = git_url + if github_token and "github" in git_url.lower(): + # Add GitHub token to URL + clone_url = git_url.replace("https://", f"https://x-access-token:{github_token}@") + logger.info("Using GITHUB_TOKEN for authentication") + elif gitlab_token and "gitlab" in git_url.lower(): + # Add GitLab token to URL + clone_url = git_url.replace("https://", f"https://oauth2:{gitlab_token}@") + logger.info("Using GITLAB_TOKEN for authentication") + + # Clone the repository + process = await asyncio.create_subprocess_exec( + "git", "clone", "--branch", branch, "--single-branch", "--depth", "1", + clone_url, str(temp_dir), + stdout=asyncio.subprocess.PIPE, + stderr=asyncio.subprocess.PIPE + ) + stdout, stderr = await process.communicate() + + if process.returncode != 0: + # Redact tokens from error message + error_msg = stderr.decode() + if github_token: + error_msg = error_msg.replace(github_token, "***REDACTED***") + if gitlab_token: + error_msg = error_msg.replace(gitlab_token, "***REDACTED***") + logger.error(f"Failed to clone repo: {error_msg}") + return False, "" + + logger.info("Clone successful, moving to final location...") + + # Move to final location + repo_final.parent.mkdir(parents=True, exist_ok=True) + shutil.move(str(temp_dir), str(repo_final)) + + logger.info(f"Repo '{name}' ready at {repo_final}") + return True, str(repo_final) + + except Exception as e: + logger.error(f"Error cloning repo: {e}") + return False, "" + finally: + # Cleanup temp directory if it still exists + if temp_dir.exists(): + shutil.rmtree(temp_dir, ignore_errors=True) + + async def trigger_workflow_greeting(git_url: str, branch: str, path: str): """Trigger workflow greeting after workflow change.""" import uuid @@ -390,7 +588,7 @@ async def trigger_workflow_greeting(git_url: str, branch: str, path: str): @app.post("/repos/add") async def add_repo(request: Request): """ - Add repository - triggers Claude SDK client restart. + Add repository - clones repo and triggers Claude SDK client restart. Accepts: {"url": "...", "branch": "...", "name": "..."} """ @@ -400,7 +598,23 @@ async def add_repo(request: Request): raise HTTPException(status_code=503, detail="Adapter not initialized") body = await request.json() - logger.info(f"Add repo request: {body}") + url = body.get("url", "") + branch = body.get("branch", "main") + name = body.get("name", "") + + logger.info(f"Add repo request: url={url}, branch={branch}, name={name}") + + if not url: + raise HTTPException(status_code=400, detail="Repository URL is required") + + # Derive name from URL if not provided + if not name: + name = url.split("/")[-1].removesuffix(".git") + + # Clone the repository at runtime + success, repo_path = await clone_repo_at_runtime(url, branch, name) + if not success: + raise HTTPException(status_code=500, detail=f"Failed to clone repository: {url}") # Update REPOS_JSON env var repos_json = os.getenv("REPOS_JSON", "[]") @@ -411,22 +625,81 @@ async def add_repo(request: Request): # Add new repo repos.append({ - "name": body.get("name", ""), + "name": name, "input": { - "url": body.get("url", ""), - "branch": body.get("branch", "main") + "url": url, + "branch": branch } }) os.environ["REPOS_JSON"] = json.dumps(repos) - # Reset adapter state + # Reset adapter state to force reinitialization on next run _adapter_initialized = False adapter._first_run = True - logger.info(f"Repo added, adapter will reinitialize on next run") + logger.info(f"Repo '{name}' added and cloned, adapter will reinitialize on next run") + + # Trigger a notification to Claude about the new repository + asyncio.create_task(trigger_repo_added_notification(name, url)) + + return {"message": "Repository added", "name": name, "path": repo_path} + + +async def trigger_repo_added_notification(repo_name: str, repo_url: str): + """Notify Claude that a repository has been added.""" + import uuid + import aiohttp + + # Wait a moment for repo to be fully ready + await asyncio.sleep(1) + + logger.info(f"Triggering repo added notification for: {repo_name}") + + try: + backend_url = os.getenv("BACKEND_API_URL", "").rstrip("/") + project_name = os.getenv("AGENTIC_SESSION_NAMESPACE", "").strip() + session_id = context.session_id if context else "unknown" + + if not backend_url or not project_name: + logger.error("Cannot trigger repo notification: BACKEND_API_URL or PROJECT_NAME not set") + return + + url = f"{backend_url}/projects/{project_name}/agentic-sessions/{session_id}/agui/run" + + notification = f"The repository '{repo_name}' has been added to your workspace. You can now access it at the path 'repos/{repo_name}/'. Please acknowledge this to the user and let them know you can now read and work with files in this repository." + + payload = { + "threadId": session_id, + "runId": str(uuid.uuid4()), + "messages": [{ + "id": str(uuid.uuid4()), + "role": "user", + "content": notification, + "metadata": { + "hidden": True, + "autoSent": True, + "source": "repo_added" + } + }] + } + + bot_token = os.getenv("BOT_TOKEN", "").strip() + headers = {"Content-Type": "application/json"} + if bot_token: + headers["Authorization"] = f"Bearer {bot_token}" + + async with aiohttp.ClientSession() as session: + async with session.post(url, json=payload, headers=headers) as resp: + if resp.status == 200: + result = await resp.json() + logger.info(f"Repo notification sent: {result}") + else: + error_text = await resp.text() + logger.error(f"Repo notification failed: {resp.status} - {error_text}") - return {"message": "Repository added"} + except Exception as e: + logger.error(f"Failed to trigger repo notification: {e}") @app.post("/repos/remove") diff --git a/components/runners/state-sync/hydrate.sh b/components/runners/state-sync/hydrate.sh index 165f198c4..4c33d2ada 100644 --- a/components/runners/state-sync/hydrate.sh +++ b/components/runners/state-sync/hydrate.sh @@ -14,11 +14,12 @@ NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}" # Paths to sync (must match sync.sh) +# Note: .claude uses /app/.claude (SubPath mount), others use /workspace SYNC_PATHS=( - ".claude" "artifacts" "file-uploads" ) +CLAUDE_DATA_PATH="/app/.claude" # Error handler error_exit() { @@ -56,14 +57,15 @@ echo "=========================================" # Create workspace structure echo "Creating workspace structure..." -mkdir -p /workspace/.claude || error_exit "Failed to create .claude directory" +# .claude is mounted at /app/.claude via SubPath (same location as runner container) +mkdir -p "${CLAUDE_DATA_PATH}" || error_exit "Failed to create .claude directory" mkdir -p /workspace/artifacts || error_exit "Failed to create artifacts directory" mkdir -p /workspace/file-uploads || error_exit "Failed to create file-uploads directory" mkdir -p /workspace/repos || error_exit "Failed to create repos directory" # Set permissions on created directories (not root workspace which may be owned by different user) # Use 755 instead of 777 - readable by all, writable only by owner -chmod 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true +chmod 755 "${CLAUDE_DATA_PATH}" /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true # Check if S3 is configured if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then @@ -90,7 +92,19 @@ echo "Checking for existing session state in S3..." if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/" 2>/dev/null | grep -q .; then echo "Found existing session state, downloading from S3..." - # Download each sync path if it exists + # Download .claude data to /app/.claude (SubPath mount matches runner container) + if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/.claude/" 2>/dev/null | grep -q .; then + echo " Downloading .claude/..." + rclone --config /tmp/.config/rclone/rclone.conf copy "${S3_PATH}/.claude/" "${CLAUDE_DATA_PATH}/" \ + --copy-links \ + --transfers 8 \ + --fast-list \ + --progress 2>&1 || echo " Warning: failed to download .claude" + else + echo " No data for .claude/" + fi + + # Download other sync paths to /workspace for path in "${SYNC_PATHS[@]}"; do if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/${path}/" 2>/dev/null | grep -q .; then echo " Downloading ${path}/..." @@ -111,7 +125,7 @@ fi # Set permissions on subdirectories (EmptyDir root may not be chmodable) echo "Setting permissions on subdirectories..." -chmod -R 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true +chmod -R 755 "${CLAUDE_DATA_PATH}" /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true # ======================================== # Clone repositories and workflows diff --git a/components/runners/state-sync/sync.sh b/components/runners/state-sync/sync.sh index 05498ac5f..401ef30d1 100644 --- a/components/runners/state-sync/sync.sh +++ b/components/runners/state-sync/sync.sh @@ -16,11 +16,12 @@ NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}" # Paths to sync (non-git content) +# Note: .claude uses /app/.claude (SubPath mount), others use /workspace SYNC_PATHS=( - ".claude" "artifacts" "file-uploads" ) +CLAUDE_DATA_PATH="/app/.claude" # Patterns to exclude from sync EXCLUDE_PATTERNS=( @@ -57,6 +58,14 @@ EOF # Check total size before sync check_size() { local total=0 + + # Check .claude directory size (at /app/.claude via SubPath) + if [ -d "${CLAUDE_DATA_PATH}" ]; then + size=$(du -sb "${CLAUDE_DATA_PATH}" 2>/dev/null | cut -f1 || echo 0) + total=$((total + size)) + fi + + # Check other paths in /workspace for path in "${SYNC_PATHS[@]}"; do if [ -d "/workspace/${path}" ]; then size=$(du -sb "/workspace/${path}" 2>/dev/null | cut -f1 || echo 0) @@ -79,6 +88,26 @@ sync_to_s3() { echo "[$(date -Iseconds)] Starting sync to S3..." local synced=0 + + # Sync .claude data from /app/.claude (SubPath mount matches runner container) + if [ -d "${CLAUDE_DATA_PATH}" ]; then + echo " Syncing .claude/..." + if rclone --config /tmp/.config/rclone/rclone.conf sync "${CLAUDE_DATA_PATH}" "${s3_path}/.claude/" \ + --checksum \ + --copy-links \ + --transfers 4 \ + --fast-list \ + --stats-one-line \ + --max-size ${MAX_SYNC_SIZE} \ + $(printf -- '--exclude %s ' "${EXCLUDE_PATTERNS[@]}") \ + 2>&1; then + synced=$((synced + 1)) + else + echo " Warning: sync of .claude had errors" + fi + fi + + # Sync other paths from /workspace for path in "${SYNC_PATHS[@]}"; do if [ -d "/workspace/${path}" ]; then echo " Syncing ${path}/..." From a27a37f3f275217af208c5bc3b57f3d123afdf10 Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Mon, 5 Jan 2026 21:26:25 -0600 Subject: [PATCH 3/6] refactor: Clean up session handling and remove deprecated workspace access endpoints - Removed deprecated workspace access endpoints from session routes, streamlining API. - Enhanced session metadata extraction for improved error handling in GetSession. - Updated comments and TODOs in reconciler and session handler files to reflect ongoing migration to controller-runtime patterns. --- components/backend/handlers/sessions.go | 31 ++++--------- components/backend/routes.go | 2 - .../operator/internal/handlers/reconciler.go | 9 +++- .../operator/internal/handlers/sessions.go | 44 +++++-------------- 4 files changed, 26 insertions(+), 60 deletions(-) diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go index 6af2681a8..276d0b019 100644 --- a/components/backend/handlers/sessions.go +++ b/components/backend/handlers/sessions.go @@ -748,10 +748,18 @@ func GetSession(c *gin.Context) { return } + // Safely extract metadata using type-safe pattern + metadata, ok := item.Object["metadata"].(map[string]interface{}) + if !ok { + log.Printf("GetSession: invalid metadata for session %s", sessionName) + c.JSON(http.StatusInternalServerError, gin.H{"error": "Invalid session metadata"}) + return + } + session := types.AgenticSession{ APIVersion: item.GetAPIVersion(), Kind: item.GetKind(), - Metadata: item.Object["metadata"].(map[string]interface{}), + Metadata: metadata, } if spec, ok := item.Object["spec"].(map[string]interface{}); ok { @@ -2102,27 +2110,6 @@ func StopSession(c *gin.Context) { c.JSON(http.StatusAccepted, session) } -// EnableWorkspaceAccess is deprecated - temporary content pods have been removed -// POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/enable -func EnableWorkspaceAccess(c *gin.Context) { - c.JSON(http.StatusGone, gin.H{ - "error": "Temporary workspace access has been removed", - "message": "Session artifacts are now stored in S3. Access artifacts directly from your S3 bucket.", - "hint": "Configure S3 storage in project settings to persist session state and artifacts.", - "s3Path": fmt.Sprintf("s3://{bucket}/{namespace}/%s/", c.Param("sessionName")), - }) -} - -// TouchWorkspaceAccess updates the last-accessed timestamp to keep temp pod alive -// POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/touch -func TouchWorkspaceAccess(c *gin.Context) { - // Deprecated: Temp-content pods no longer exist - c.JSON(http.StatusGone, gin.H{ - "error": "Temporary workspace access has been removed", - "message": "Session artifacts are stored in S3 and do not require touch/keepalive.", - }) -} - // GetSessionK8sResources returns job, pod, and PVC information for a session // GET /api/projects/:projectName/agentic-sessions/:sessionName/k8s-resources func GetSessionK8sResources(c *gin.Context) { diff --git a/components/backend/routes.go b/components/backend/routes.go index 7e8c95df4..539ca4ea5 100644 --- a/components/backend/routes.go +++ b/components/backend/routes.go @@ -56,8 +56,6 @@ func registerRoutes(r *gin.Engine) { projectGroup.POST("/agentic-sessions/:sessionName/clone", handlers.CloneSession) projectGroup.POST("/agentic-sessions/:sessionName/start", handlers.StartSession) projectGroup.POST("/agentic-sessions/:sessionName/stop", handlers.StopSession) - projectGroup.POST("/agentic-sessions/:sessionName/workspace/enable", handlers.EnableWorkspaceAccess) - projectGroup.POST("/agentic-sessions/:sessionName/workspace/touch", handlers.TouchWorkspaceAccess) projectGroup.GET("/agentic-sessions/:sessionName/workspace", handlers.ListSessionWorkspace) projectGroup.GET("/agentic-sessions/:sessionName/workspace/*path", handlers.GetSessionWorkspaceFile) projectGroup.PUT("/agentic-sessions/:sessionName/workspace/*path", handlers.PutSessionWorkspaceFile) diff --git a/components/operator/internal/handlers/reconciler.go b/components/operator/internal/handlers/reconciler.go index f7982932e..a6e079fa9 100644 --- a/components/operator/internal/handlers/reconciler.go +++ b/components/operator/internal/handlers/reconciler.go @@ -10,15 +10,20 @@ import ( "time" corev1 "k8s.io/api/core/v1" - v1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" "ambient-code-operator/internal/config" - "ambient-code-operator/internal/types" ) // ReconcilePendingSession handles the Pending phase - creates pod and services. // This is the main entry point called from the controller for pending sessions. +// +// TODO(controller-runtime-migration): This is a transitional wrapper around the legacy +// handleAgenticSessionEvent() function (2,300+ lines). Future work should: +// 1. Extract phase-specific logic into separate functions (ReconcilePending, ReconcileRunning, etc.) +// 2. Use controller-runtime patterns (Patch, StatusWriter, etc.) instead of direct API calls +// 3. Remove handleAgenticSessionEvent() entirely +// This approach allows adopting controller-runtime framework without rewriting all logic at once. func ReconcilePendingSession(ctx context.Context, session *unstructured.Unstructured, appConfig *config.Config) error { // Delegate to existing handleAgenticSessionEvent logic // This is a wrapper that allows the existing code to be called from the controller diff --git a/components/operator/internal/handlers/sessions.go b/components/operator/internal/handlers/sessions.go index 867a09e68..2fab8f408 100644 --- a/components/operator/internal/handlers/sessions.go +++ b/components/operator/internal/handlers/sessions.go @@ -25,20 +25,25 @@ import ( v1 "k8s.io/apimachinery/pkg/apis/meta/v1" "k8s.io/apimachinery/pkg/apis/meta/v1/unstructured" intstr "k8s.io/apimachinery/pkg/util/intstr" - "k8s.io/apimachinery/pkg/watch" "k8s.io/client-go/util/retry" ) // Track which pods are currently being monitored to prevent duplicate goroutines -// NOTE: This is used by the legacy handleAgenticSessionEvent function which is -// kept for reference but no longer actively called by the operator. -// The controller-runtime based reconciler in internal/controller/ handles all -// AgenticSession reconciliation now. var ( monitoredPods = make(map[string]bool) monitoredPodsMu sync.Mutex ) +// handleAgenticSessionEvent is the legacy reconciliation function containing all session +// lifecycle logic (~2,300 lines). It's called by ReconcilePendingSession() wrapper. +// +// TODO(controller-runtime-migration): This function should be refactored into smaller, +// phase-specific reconcilers that use controller-runtime patterns. Current architecture: +// - ✅ Controller-runtime framework adopted (work queue, leader election, metrics) +// - ⚠️ Business logic still uses legacy patterns (direct API calls, manual status updates) +// - 🔜 Future: Break into ReconcilePending, ReconcileRunning, ReconcileStopped functions +// +// This transitional approach allows framework adoption without rewriting 2,300 lines at once. func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { name := obj.GetName() sessionNamespace := obj.GetNamespace() @@ -681,35 +686,6 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error { }) } - // Extract repos configuration (simplified format: url and branch) - type RepoConfig struct { - URL string - Branch string - } - - var repos []RepoConfig - - // Read repos[] array format - if reposArr, found, _ := unstructured.NestedSlice(spec, "repos"); found && len(reposArr) > 0 { - repos = make([]RepoConfig, 0, len(reposArr)) - for _, repoItem := range reposArr { - if repoMap, ok := repoItem.(map[string]interface{}); ok { - repo := RepoConfig{} - if url, ok := repoMap["url"].(string); ok { - repo.URL = url - } - if branch, ok := repoMap["branch"].(string); ok { - repo.Branch = branch - } else { - repo.Branch = "main" - } - if repo.URL != "" { - repos = append(repos, repo) - } - } - } - } - // Read autoPushOnComplete flag autoPushOnComplete, _, _ := unstructured.NestedBool(spec, "autoPushOnComplete") From 512b7ec9b19cec296b676029bb01441b261e2872 Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Mon, 5 Jan 2026 22:11:28 -0600 Subject: [PATCH 4/6] refactor: Clean up code formatting and improve readability - Removed unnecessary blank lines in agenticsession_controller.go and reconcile_phases.go for better code clarity. - Standardized the formatting of metric variable declarations in otel_metrics.go to enhance consistency across the file. --- .../internal/controller/agenticsession_controller.go | 1 - .../operator/internal/controller/otel_metrics.go | 10 +++++----- .../operator/internal/controller/reconcile_phases.go | 12 +++++------- 3 files changed, 10 insertions(+), 13 deletions(-) diff --git a/components/operator/internal/controller/agenticsession_controller.go b/components/operator/internal/controller/agenticsession_controller.go index f85e2167a..a33c9778f 100644 --- a/components/operator/internal/controller/agenticsession_controller.go +++ b/components/operator/internal/controller/agenticsession_controller.go @@ -281,7 +281,6 @@ func (r *AgenticSessionReconciler) SetupWithManager(mgr ctrl.Manager) error { return nil } - // GetGVR returns the GroupVersionResource for AgenticSession func GetGVR() schema.GroupVersionResource { return optypes.GetAgenticSessionResource() diff --git a/components/operator/internal/controller/otel_metrics.go b/components/operator/internal/controller/otel_metrics.go index d6c4fba0f..a6101118e 100644 --- a/components/operator/internal/controller/otel_metrics.go +++ b/components/operator/internal/controller/otel_metrics.go @@ -38,11 +38,11 @@ var ( sessionsByProject metric.Int64Counter // Error metrics (counters) - reconcileRetries metric.Int64Counter - sessionTimeouts metric.Int64Counter - s3Errors metric.Int64Counter - tokenRefreshErrors metric.Int64Counter - podRestarts metric.Int64Counter + reconcileRetries metric.Int64Counter + sessionTimeouts metric.Int64Counter + s3Errors metric.Int64Counter + tokenRefreshErrors metric.Int64Counter + podRestarts metric.Int64Counter ) // InitMetrics initializes OpenTelemetry metrics diff --git a/components/operator/internal/controller/reconcile_phases.go b/components/operator/internal/controller/reconcile_phases.go index 3dbe4db62..082b53772 100644 --- a/components/operator/internal/controller/reconcile_phases.go +++ b/components/operator/internal/controller/reconcile_phases.go @@ -91,11 +91,11 @@ func recordImagePullDuration(namespace string, pod *corev1.Pod) { // Check all containers for image pull timing for _, cs := range pod.Status.ContainerStatuses { - if cs.State.Running != nil && cs.State.Running.StartedAt.Time.After(podCreated) { + if cs.State.Running != nil && cs.State.Running.StartedAt.After(podCreated) { // Approximate image pull duration as time from pod creation to container start // This includes scheduling + image pull + container creation - duration := cs.State.Running.StartedAt.Time.Sub(podCreated).Seconds() - + duration := cs.State.Running.StartedAt.Sub(podCreated).Seconds() + // Extract image name (remove tag/digest for cleaner metrics) image := cs.Image if idx := strings.Index(image, "@"); idx != -1 { @@ -103,9 +103,9 @@ func recordImagePullDuration(namespace string, pod *corev1.Pod) { } else if idx := strings.LastIndex(image, ":"); idx != -1 { image = image[:idx] } - + RecordImagePullDuration(namespace, image, duration) - + // Log for first container only (usually the runner) log.Log.Info("Image pull completed", "namespace", namespace, @@ -147,7 +147,6 @@ func recordStartupTime(namespace, sessionName string, session *unstructured.Unst ) } - // reconcilePending handles sessions in Pending phase. // This creates the runner pod and transitions to Creating phase. func (r *AgenticSessionReconciler) reconcilePending(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) { @@ -379,4 +378,3 @@ func (r *AgenticSessionReconciler) reconcileStopping(ctx context.Context, sessio // Requeue to check again return ctrl.Result{RequeueAfter: 2 * time.Second}, nil } - From 54b33821f3d9046c00af0aa4346844cee7f3579c Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Mon, 5 Jan 2026 23:46:14 -0600 Subject: [PATCH 5/6] refactor: Improve session detail and message handling - Updated repository path handling in ProjectSessionDetailPage to ensure consistency in workspace structure. - Enhanced conditional display logic for the welcome experience based on session status, improving user interaction. - Refined chat interface visibility logic in MessagesTab to only show when the session is in the Running state, clarifying user expectations. - Adjusted dropdown menu visibility to only appear when there are stream messages, streamlining the UI. --- .../[name]/sessions/[sessionName]/page.tsx | 5 ++- .../src/components/session/MessagesTab.tsx | 39 ++++++++++--------- 2 files changed, 24 insertions(+), 20 deletions(-) diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx index 987230788..2617325b9 100644 --- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx +++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx @@ -624,10 +624,11 @@ export default function ProjectSessionDetailPage({ if (session?.spec?.repos) { session.spec.repos.forEach((repo, idx) => { const repoName = repo.url.split('/').pop()?.replace('.git', '') || `repo-${idx}`; + // Repos are cloned to /workspace/repos/{name} options.push({ type: "repo", name: repoName, - path: repoName, + path: `repos/${repoName}`, }); }); } @@ -1905,7 +1906,7 @@ export default function ProjectSessionDetailPage({ workflowMetadata={workflowMetadata} onCommandClick={handleCommandClick} isRunActive={isRunActive} - showWelcomeExperience={true} + showWelcomeExperience={!["Completed", "Failed", "Stopped", "Stopping"].includes(session?.status?.phase || "")} activeWorkflow={workflowManagement.activeWorkflow} userHasInteracted={userHasInteracted} queuedMessages={sessionQueue.messages} diff --git a/components/frontend/src/components/session/MessagesTab.tsx b/components/frontend/src/components/session/MessagesTab.tsx index 3e45d60b6..e37891e8a 100644 --- a/components/frontend/src/components/session/MessagesTab.tsx +++ b/components/frontend/src/components/session/MessagesTab.tsx @@ -63,8 +63,9 @@ const MessagesTab: React.FC = ({ session, streamMessages, chat const phase = session?.status?.phase || ""; const isInteractive = session?.spec?.interactive; - // Show chat interface when session is interactive AND (in Running state OR showing welcome experience) - const showChatInterface = isInteractive && (phase === "Running" || showWelcomeExperience); + // Show chat interface only when session is interactive AND Running + // Welcome experience can be shown during Pending/Creating, but chat input only when Running + const showChatInterface = isInteractive && phase === "Running"; // Determine if session is in a terminal state const isTerminalState = ["Completed", "Failed", "Stopped"].includes(phase); @@ -713,26 +714,28 @@ const MessagesTab: React.FC = ({ session, streamMessages, chat
)} - {isInteractive && !showChatInterface && streamMessages.length > 0 && ( + {isInteractive && !showChatInterface && (streamMessages.length > 0 || isCreating || isTerminalState) && (
- - - - - - - Show system messages - - - + {streamMessages.length > 0 && ( + + + + + + + Show system messages + + + + )}

{isCreating && "Chat will be available once the session is running..."} {isTerminalState && ( From 9f34b80a6b1a91a2a4c38fefd44abfedf175c001 Mon Sep 17 00:00:00 2001 From: Gage Krumbach Date: Mon, 5 Jan 2026 23:54:11 -0600 Subject: [PATCH 6/6] feat: Add state-sync component and observability stack deployment - Introduced a new state-sync component in the build and deploy workflows, enhancing the deployment process. - Added steps to deploy the observability stack in both components-build-deploy and prod-release-deploy workflows. - Updated kustomization to include the state-sync image for consistent image tagging across environments. - Enhanced environment variable settings to include the state-sync image in deployment configurations. --- .github/workflows/components-build-deploy.yml | 17 +++++++++++++++-- .github/workflows/prod-release-deploy.yaml | 12 +++++++++++- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/.github/workflows/components-build-deploy.yml b/.github/workflows/components-build-deploy.yml index b3ea8bf6f..5bf6b5a88 100644 --- a/.github/workflows/components-build-deploy.yml +++ b/.github/workflows/components-build-deploy.yml @@ -84,6 +84,11 @@ jobs: image: quay.io/ambient_code/vteam_claude_runner dockerfile: ./components/runners/claude-code-runner/Dockerfile changed: ${{ needs.detect-changes.outputs.claude-runner }} + - name: state-sync + context: ./components/runners + image: quay.io/ambient_code/vteam_state_sync + dockerfile: ./components/runners/state-sync/Dockerfile + changed: ${{ needs.detect-changes.outputs.claude-runner }} steps: - name: Checkout code if: matrix.component.changed == 'true' || github.event_name == 'workflow_dispatch' @@ -163,6 +168,10 @@ jobs: oc apply -k components/manifests/base/rbac/ oc apply -f components/manifests/overlays/production/operator-config-openshift.yaml -n ambient-code + - name: Deploy observability stack + run: | + oc apply -k components/manifests/observability/ + deploy-to-openshift: runs-on: ubuntu-latest needs: [detect-changes, build-and-push, update-rbac-and-crd] @@ -220,6 +229,7 @@ jobs: kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }} kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${{ steps.image-tags.outputs.operator_tag }} kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${{ steps.image-tags.outputs.runner_tag }} + kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${{ steps.image-tags.outputs.runner_tag }} - name: Validate kustomization working-directory: components/manifests/overlays/production @@ -250,7 +260,8 @@ jobs: run: | oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \ AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:${{ steps.image-tags.outputs.runner_tag }}" \ - CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }}" + CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }}" \ + STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:${{ steps.image-tags.outputs.runner_tag }}" deploy-with-disptach: runs-on: ubuntu-latest @@ -282,6 +293,7 @@ jobs: kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:stage kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:stage kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:stage + kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:stage - name: Validate kustomization working-directory: components/manifests/overlays/production @@ -309,4 +321,5 @@ jobs: run: | oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \ AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:stage" \ - CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:stage" + CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:stage" \ + STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:stage" diff --git a/.github/workflows/prod-release-deploy.yaml b/.github/workflows/prod-release-deploy.yaml index fc4f198f4..27b644355 100644 --- a/.github/workflows/prod-release-deploy.yaml +++ b/.github/workflows/prod-release-deploy.yaml @@ -158,6 +158,10 @@ jobs: context: ./components/runners image: quay.io/ambient_code/vteam_claude_runner dockerfile: ./components/runners/claude-code-runner/Dockerfile + - name: state-sync + context: ./components/runners + image: quay.io/ambient_code/vteam_state_sync + dockerfile: ./components/runners/state-sync/Dockerfile steps: - name: Checkout code from the tag generated above uses: actions/checkout@v5 @@ -221,6 +225,10 @@ jobs: run: | oc login ${{ secrets.PROD_OPENSHIFT_SERVER }} --token=${{ secrets.PROD_OPENSHIFT_TOKEN }} --insecure-skip-tls-verify + - name: Deploy observability stack + run: | + oc apply -k components/manifests/observability/ + - name: Update kustomization with release image tags working-directory: components/manifests/overlays/production run: | @@ -229,6 +237,7 @@ jobs: kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${RELEASE_TAG} kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${RELEASE_TAG} kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${RELEASE_TAG} + kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${RELEASE_TAG} - name: Validate kustomization working-directory: components/manifests/overlays/production @@ -256,4 +265,5 @@ jobs: run: | oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \ AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:${{ needs.release.outputs.new_tag }}" \ - CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ needs.release.outputs.new_tag }}" + CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ needs.release.outputs.new_tag }}" \ + STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:${{ needs.release.outputs.new_tag }}"