From 193b0340c6c257f95d40a96ea08515d24078b6dd Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Sat, 20 Dec 2025 07:36:43 -0600
Subject: [PATCH 1/6] feat: Enhance session handling and observability
 improvements

- Refactored session management to improve clarity and efficiency, including the removal of self-referential parent-session-id annotations.
- Updated session workspace path handling to be relative to the content service's StateBaseDir, simplifying path management.
- Introduced graceful shutdown for the content service, enhancing reliability during server termination.
- Enhanced observability stack with new Grafana dashboard configurations and metrics for session lifecycle tracking.
- Cleaned up unused code and improved logging for better debugging and maintenance.

chore: Update .gitignore and remove obsolete deployment documentation

- Added build log and log file patterns to .gitignore to prevent accidental commits.
- Deleted outdated deployment documentation files: DEPLOYMENT_CHANGES.md, DIFF_IMPROVEMENTS.md, S3_MIGRATION_GAPS.md, and OPENSHIFT_SETUP.md, which are no longer relevant to the current architecture.
- Cleaned up observability-related files, including Grafana and Prometheus configurations, to streamline the observability stack.

feat: Enhance operator metrics and session handling

- Introduced Prometheus metrics for monitoring session lifecycle, including startup duration, phase transitions, and error tracking.
- Updated session handling to record metrics during reconciliation, including session creation and completion.
- Refactored session management logic to ensure consistent behavior across API and kubectl session creations.
- Increased QPS and Burst settings for Kubernetes client to improve performance under load.
- Added a new Service and ServiceMonitor for exposing operator metrics in the ambient-code namespace.

feat: Refactor AgenticSession handling to use Pods instead of Jobs

- Updated the operator to create and manage Pods directly for AgenticSessions, improving startup speed and reducing complexity.
- Changed environment variable references and logging to reflect the transition from Jobs to Pods.
- Adjusted cleanup logic to handle Pods appropriately, including service creation and monitoring.
- Modified deployment configurations to ensure compatibility with the new Pod-based architecture.

feat: Implement S3 storage configuration for session artifacts

- Added support for S3-compatible storage in the settings section, allowing users to configure S3 endpoint, bucket, region, access key, and secret key.
- Updated the operator to persist session state and artifacts in S3, replacing the previous temporary content pod mechanism.
- Removed deprecated references to temporary content pods and PVCs, transitioning to an EmptyDir storage model with S3 integration.
- Enhanced the operator's handling of S3 configuration, ensuring proper validation and logging for S3 settings.
- Updated Makefile to include new build targets for state-sync image and MinIO setup.

feat: Enhance operator deployment with controller-runtime features

- Added command-line arguments for metrics and health probe endpoints, enabling better observability.
- Implemented concurrent reconciliation with a configurable maximum, improving performance.
- Updated Dockerfile to use ENTRYPOINT for better argument handling.
- Enhanced health checks with HTTP probes for liveness and readiness.
- Updated README to reflect new configuration options and features.

feat: Enhance observability stack deployment and cleanup in Makefile

- Added new targets for deploying and cleaning up the observability stack, including OpenTelemetry and Grafana.
- Introduced commands for accessing Grafana and Prometheus dashboards.
- Updated .gitignore to include secrets template for MinIO credentials.
- Removed deprecated image-prepuller DaemonSet and associated metrics service from manifests.
- Updated Makefile to reflect changes in observability management and improve user experience.

refactor: Clean up observability stack and enhance session handling

- Removed obsolete observability stack deployment commands from Makefile.
- Updated session handling in the operator to improve clarity and efficiency.
- Introduced a new state sync image in deployment scripts and updated related configurations.
- Refactored metrics handling for session lifecycle, ensuring consistent error tracking and performance monitoring.
- Cleaned up unused code and improved readability across multiple files.

feat: Refactor S3 storage configuration in settings and operator

- Replaced S3_ENABLED with STORAGE_MODE to allow selection between shared and custom storage options.
- Updated settings section to include radio buttons for storage mode selection, enhancing user experience.
- Modified operator session handling to read and apply storage mode, ensuring proper configuration for S3 settings.
- Improved logging for storage mode usage, clarifying the configuration process for users.
---
 .gitignore                                    |    8 +
 Makefile                                      |   61 +-
 components/backend/handlers/sessions.go       |  645 ++-----
 components/backend/server/server.go           |   42 +-
 .../[name]/sessions/[sessionName]/page.tsx    |   11 -
 .../sessions/[sessionName]/session-header.tsx |   15 -
 .../src/components/session-details-modal.tsx  |  116 +-
 .../workspace-sections/settings-section.tsx   |  161 +-
 .../frontend/src/types/project-settings.ts    |    8 +
 components/manifests/base/kustomization.yaml  |    3 +
 .../minio-credentials-secret.yaml.example     |   31 +
 .../manifests/base/minio-deployment.yaml      |  102 +
 .../manifests/base/operator-deployment.yaml   |   46 +-
 .../base/rbac/operator-clusterrole.yaml       |    4 +-
 components/manifests/deploy.sh                |    4 +
 components/manifests/observability/README.md  |  191 ++
 .../ambient-operator-dashboard.json           |  366 ++++
 .../manifests/observability/grafana.yaml      |  490 +++++
 .../observability/kustomization.yaml          |   15 +
 .../observability/otel-collector.yaml         |  108 ++
 .../grafana-datasource-patch.yaml             |   45 +
 .../overlays/with-grafana/kustomization.yaml  |   15 +
 .../observability/servicemonitor.yaml         |   22 +
 .../overlays/production/kustomization.yaml    |    3 +
 components/operator/Dockerfile                |    6 +-
 components/operator/README.md                 |  155 +-
 components/operator/go.mod                    |   34 +-
 components/operator/go.sum                    |   83 +-
 components/operator/internal/config/config.go |   28 +
 .../controller/agenticsession_controller.go   |  301 +++
 .../internal/controller/otel_metrics.go       |  467 +++++
 .../internal/controller/reconcile_phases.go   |  382 ++++
 .../operator/internal/handlers/helpers.go     |    7 +-
 .../operator/internal/handlers/namespaces.go  |    7 +-
 .../operator/internal/handlers/reconciler.go  |  450 +++++
 .../operator/internal/handlers/sessions.go    | 1716 +++++++----------
 .../internal/services/infrastructure.go       |   34 +-
 components/operator/main.go                   |  191 +-
 .../runners/claude-code-runner/adapter.py     |  284 +--
 components/runners/claude-code-runner/main.py |   33 +-
 components/runners/state-sync/Dockerfile      |   21 +
 components/runners/state-sync/hydrate.sh      |  232 +++
 components/runners/state-sync/sync.sh         |  156 ++
 docs/minio-quickstart.md                      |  297 +++
 docs/operator-metrics-visualization.md        |  134 ++
 docs/s3-storage-configuration.md              |  393 ++++
 scripts/setup-minio.sh                        |   85 +
 47 files changed, 6042 insertions(+), 1966 deletions(-)
 create mode 100644 components/manifests/base/minio-credentials-secret.yaml.example
 create mode 100644 components/manifests/base/minio-deployment.yaml
 create mode 100644 components/manifests/observability/README.md
 create mode 100644 components/manifests/observability/dashboards/ambient-operator-dashboard.json
 create mode 100644 components/manifests/observability/grafana.yaml
 create mode 100644 components/manifests/observability/kustomization.yaml
 create mode 100644 components/manifests/observability/otel-collector.yaml
 create mode 100644 components/manifests/observability/overlays/with-grafana/grafana-datasource-patch.yaml
 create mode 100644 components/manifests/observability/overlays/with-grafana/kustomization.yaml
 create mode 100644 components/manifests/observability/servicemonitor.yaml
 create mode 100644 components/operator/internal/controller/agenticsession_controller.go
 create mode 100644 components/operator/internal/controller/otel_metrics.go
 create mode 100644 components/operator/internal/controller/reconcile_phases.go
 create mode 100644 components/operator/internal/handlers/reconciler.go
 create mode 100644 components/runners/state-sync/Dockerfile
 create mode 100644 components/runners/state-sync/hydrate.sh
 create mode 100644 components/runners/state-sync/sync.sh
 create mode 100644 docs/minio-quickstart.md
 create mode 100644 docs/operator-metrics-visualization.md
 create mode 100644 docs/s3-storage-configuration.md
 create mode 100755 scripts/setup-minio.sh

diff --git a/.gitignore b/.gitignore
index 4925129cb..b84450271 100644
--- a/.gitignore
+++ b/.gitignore
@@ -140,3 +140,11 @@ reports/
 # Security scan artifacts (transient)
 .security-scan/
 .security-scan.zip
+
+# Secrets (should use .example templates)
+**/minio-credentials-secret.yaml
+
+# Build artifacts and logs
+build.log
+*.log
+!components/**/*.log
diff --git a/Makefile b/Makefile
index 13fa26ca6..987164c05 100644
--- a/Makefile
+++ b/Makefile
@@ -1,10 +1,11 @@
-.PHONY: help setup build-all build-frontend build-backend build-operator build-runner deploy clean
+.PHONY: help setup build-all build-frontend build-backend build-operator build-runner build-state-sync deploy clean
 .PHONY: local-up local-down local-clean local-status local-rebuild local-reload-backend local-reload-frontend local-reload-operator local-sync-version
 .PHONY: local-dev-token
 .PHONY: local-logs local-logs-backend local-logs-frontend local-logs-operator local-shell local-shell-frontend
 .PHONY: local-test local-test-dev local-test-quick test-all local-url local-troubleshoot local-port-forward local-stop-port-forward
 .PHONY: push-all registry-login setup-hooks remove-hooks check-minikube check-kubectl
 .PHONY: e2e-test e2e-setup e2e-clean deploy-langfuse-openshift
+.PHONY: setup-minio minio-console minio-logs minio-status
 .PHONY: validate-makefile lint-makefile check-shell makefile-health
 .PHONY: _create-operator-config _auto-port-forward _show-access-info _build-and-load
 
@@ -36,6 +37,7 @@ FRONTEND_IMAGE ?= vteam_frontend:latest
 BACKEND_IMAGE ?= vteam_backend:latest
 OPERATOR_IMAGE ?= vteam_operator:latest
 RUNNER_IMAGE ?= vteam_claude_runner:latest
+STATE_SYNC_IMAGE ?= vteam_state_sync:latest
 
 # Build metadata (captured at build time)
 GIT_COMMIT := $(shell git rev-parse HEAD 2>/dev/null || echo "unknown")
@@ -91,7 +93,7 @@ help: ## Display this help message
 
 ##@ Building
 
-build-all: build-frontend build-backend build-operator build-runner ## Build all container images
+build-all: build-frontend build-backend build-operator build-runner build-state-sync ## Build all container images
 
 build-frontend: ## Build frontend image
 	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Building frontend with $(CONTAINER_ENGINE)..."
@@ -145,6 +147,13 @@ build-runner: ## Build Claude Code runner image
 		-t $(RUNNER_IMAGE) -f claude-code-runner/Dockerfile .
 	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) Runner built: $(RUNNER_IMAGE)"
 
+build-state-sync: ## Build state-sync image for S3 persistence
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Building state-sync with $(CONTAINER_ENGINE)..."
+	@echo "  Git: $(GIT_BRANCH)@$(GIT_COMMIT_SHORT)$(GIT_DIRTY)"
+	@cd components/runners/state-sync && $(CONTAINER_ENGINE) build $(PLATFORM_FLAG) $(BUILD_FLAGS) \
+		-t vteam_state_sync:latest .
+	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) State-sync built: vteam_state_sync:latest"
+
 ##@ Git Hooks
 
 setup-hooks: ## Install git hooks for branch protection
@@ -164,13 +173,59 @@ registry-login: ## Login to container registry
 
 push-all: registry-login ## Push all images to registry
 	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Pushing images to $(REGISTRY)..."
-	@for image in $(FRONTEND_IMAGE) $(BACKEND_IMAGE) $(OPERATOR_IMAGE) $(RUNNER_IMAGE); do \
+	@for image in $(FRONTEND_IMAGE) $(BACKEND_IMAGE) $(OPERATOR_IMAGE) $(RUNNER_IMAGE) $(STATE_SYNC_IMAGE); do \
 		echo "  Tagging and pushing $$image..."; \
 		$(CONTAINER_ENGINE) tag $$image $(REGISTRY)/$$image && \
 		$(CONTAINER_ENGINE) push $(REGISTRY)/$$image; \
 	done
 	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) All images pushed"
 
+##@ MinIO S3 Storage
+
+setup-minio: ## Set up MinIO and create initial bucket
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Setting up MinIO for S3 state storage..."
+	@./scripts/setup-minio.sh
+	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) MinIO setup complete"
+
+minio-console: ## Open MinIO console (port-forward to localhost:9001)
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Opening MinIO console at http://localhost:9001"
+	@echo "  Login: admin / changeme123 (or your configured credentials)"
+	@kubectl port-forward svc/minio 9001:9001 -n $(NAMESPACE)
+
+minio-logs: ## View MinIO logs
+	@kubectl logs -f deployment/minio -n $(NAMESPACE)
+
+minio-status: ## Check MinIO status
+	@echo "$(COLOR_BOLD)MinIO Status$(COLOR_RESET)"
+	@kubectl get deployment,pod,svc,pvc -l app=minio -n $(NAMESPACE)
+
+##@ Observability
+
+deploy-observability: ## Deploy observability (OTel + OpenShift Prometheus)
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Deploying observability stack..."
+	@kubectl apply -k components/manifests/observability/
+	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) Observability deployed (OTel + ServiceMonitor)"
+	@echo "  View metrics: OpenShift Console → Observe → Metrics"
+	@echo "  Optional Grafana: make add-grafana"
+
+add-grafana: ## Add Grafana on top of observability stack
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Adding Grafana..."
+	@kubectl apply -k components/manifests/observability/overlays/with-grafana/
+	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) Grafana deployed"
+	@echo "  Create route: oc create route edge grafana --service=grafana -n $(NAMESPACE)"
+
+clean-observability: ## Remove observability components
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Removing observability..."
+	@kubectl delete -k components/manifests/observability/overlays/with-grafana/ 2>/dev/null || true
+	@kubectl delete -k components/manifests/observability/ 2>/dev/null || true
+	@echo "$(COLOR_GREEN)✓$(COLOR_RESET) Observability removed"
+
+grafana-dashboard: ## Open Grafana (create route first)
+	@echo "$(COLOR_BLUE)▶$(COLOR_RESET) Opening Grafana..."
+	@oc create route edge grafana --service=grafana -n $(NAMESPACE) 2>/dev/null || echo "Route already exists"
+	@echo "  URL: https://$$(oc get route grafana -n $(NAMESPACE) -o jsonpath='{.spec.host}')"
+	@echo "  Login: admin/admin"
+
 ##@ Local Development (Minikube)
 
 local-up: check-minikube check-kubectl ## Start local development environment (minikube)
diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go
index b413c9669..591213de8 100644
--- a/components/backend/handlers/sessions.go
+++ b/components/backend/handlers/sessions.go
@@ -25,13 +25,10 @@ import (
 	"github.com/gin-gonic/gin"
 	authnv1 "k8s.io/api/authentication/v1"
 	authzv1 "k8s.io/api/authorization/v1"
-	corev1 "k8s.io/api/core/v1"
-	rbacv1 "k8s.io/api/rbac/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
 	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	"k8s.io/apimachinery/pkg/runtime/schema"
-	ktypes "k8s.io/apimachinery/pkg/types"
 	"k8s.io/client-go/dynamic"
 	"k8s.io/client-go/kubernetes"
 )
@@ -45,8 +42,6 @@ var (
 	// LEGACY: SendMessageToSession removed - AG-UI server uses HTTP/SSE instead of WebSocket
 )
 
-const runnerTokenRefreshedAtAnnotation = "ambient-code.io/token-refreshed-at"
-
 // ootbWorkflowsCache provides in-memory caching for OOTB workflows to avoid GitHub API rate limits.
 // The cache stores workflows by repo URL key and expires after ootbCacheTTL.
 type ootbWorkflowsCache struct {
@@ -719,13 +714,8 @@ func CreateSession(c *gin.Context) {
 		}
 	}()
 
-	// Provision runner token using backend SA (requires elevated permissions for SA/Role/Secret creation)
-	if DynamicClient == nil || K8sClient == nil {
-		log.Printf("Warning: backend SA clients not available, skipping runner token provisioning for session %s/%s", project, name)
-	} else if err := provisionRunnerTokenForSession(c, K8sClient, DynamicClient, project, name); err != nil {
-		// Nonfatal: log and continue. Operator may retry later if implemented.
-		log.Printf("Warning: failed to provision runner token for session %s/%s: %v", project, name, err)
-	}
+	// Runner token provisioning is handled by the operator when creating the pod.
+	// This ensures consistent behavior whether sessions are created via API or kubectl.
 
 	c.JSON(http.StatusCreated, gin.H{
 		"message": "Agentic session created successfully",
@@ -734,171 +724,6 @@ func CreateSession(c *gin.Context) {
 	})
 }
 
-// provisionRunnerTokenForSession creates a per-session ServiceAccount, grants minimal RBAC,
-// mints a short-lived token, stores it in a Secret, and annotates the AgenticSession with the Secret name.
-func provisionRunnerTokenForSession(c *gin.Context, reqK8s kubernetes.Interface, reqDyn dynamic.Interface, project string, sessionName string) error {
-	// Load owning AgenticSession to parent all resources
-	gvr := GetAgenticSessionV1Alpha1Resource()
-	obj, err := reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), sessionName, v1.GetOptions{})
-	if err != nil {
-		return fmt.Errorf("get AgenticSession: %w", err)
-	}
-	ownerRef := v1.OwnerReference{
-		APIVersion: obj.GetAPIVersion(),
-		Kind:       obj.GetKind(),
-		Name:       obj.GetName(),
-		UID:        obj.GetUID(),
-		Controller: types.BoolPtr(true),
-	}
-
-	// Create ServiceAccount
-	saName := fmt.Sprintf("ambient-session-%s", sessionName)
-	sa := &corev1.ServiceAccount{
-		ObjectMeta: v1.ObjectMeta{
-			Name:            saName,
-			Namespace:       project,
-			Labels:          map[string]string{"app": "ambient-runner"},
-			OwnerReferences: []v1.OwnerReference{ownerRef},
-		},
-	}
-	if _, err := reqK8s.CoreV1().ServiceAccounts(project).Create(c.Request.Context(), sa, v1.CreateOptions{}); err != nil {
-		if !errors.IsAlreadyExists(err) {
-			return fmt.Errorf("create SA: %w", err)
-		}
-	}
-
-	// Create Role with least-privilege for updating AgenticSession status and annotations
-	roleName := fmt.Sprintf("ambient-session-%s-role", sessionName)
-	role := &rbacv1.Role{
-		ObjectMeta: v1.ObjectMeta{
-			Name:            roleName,
-			Namespace:       project,
-			OwnerReferences: []v1.OwnerReference{ownerRef},
-		},
-		Rules: []rbacv1.PolicyRule{
-			{
-				APIGroups: []string{"vteam.ambient-code"},
-				Resources: []string{"agenticsessions"},
-				Verbs:     []string{"get", "list", "watch", "update", "patch"}, // Added update, patch for annotations
-			},
-			{
-				APIGroups: []string{"authorization.k8s.io"},
-				Resources: []string{"selfsubjectaccessreviews"},
-				Verbs:     []string{"create"},
-			},
-		},
-	}
-	// Try to create or update the Role to ensure it has latest permissions
-	if _, err := reqK8s.RbacV1().Roles(project).Create(c.Request.Context(), role, v1.CreateOptions{}); err != nil {
-		if errors.IsAlreadyExists(err) {
-			// Role exists - update it to ensure it has the latest permissions (including update/patch)
-			log.Printf("Role %s already exists, updating with latest permissions", roleName)
-			if _, err := reqK8s.RbacV1().Roles(project).Update(c.Request.Context(), role, v1.UpdateOptions{}); err != nil {
-				return fmt.Errorf("update Role: %w", err)
-			}
-			log.Printf("Successfully updated Role %s with annotation update permissions", roleName)
-		} else {
-			return fmt.Errorf("create Role: %w", err)
-		}
-	}
-
-	// Bind Role to the ServiceAccount
-	rbName := fmt.Sprintf("ambient-session-%s-rb", sessionName)
-	rb := &rbacv1.RoleBinding{
-		ObjectMeta: v1.ObjectMeta{
-			Name:            rbName,
-			Namespace:       project,
-			OwnerReferences: []v1.OwnerReference{ownerRef},
-		},
-		RoleRef:  rbacv1.RoleRef{APIGroup: "rbac.authorization.k8s.io", Kind: "Role", Name: roleName},
-		Subjects: []rbacv1.Subject{{Kind: "ServiceAccount", Name: saName, Namespace: project}},
-	}
-	if _, err := reqK8s.RbacV1().RoleBindings(project).Create(context.TODO(), rb, v1.CreateOptions{}); err != nil {
-		if !errors.IsAlreadyExists(err) {
-			return fmt.Errorf("create RoleBinding: %w", err)
-		}
-	}
-
-	// Mint short-lived K8s ServiceAccount token for CR status updates
-	tr := &authnv1.TokenRequest{Spec: authnv1.TokenRequestSpec{}}
-	tok, err := reqK8s.CoreV1().ServiceAccounts(project).CreateToken(c.Request.Context(), saName, tr, v1.CreateOptions{})
-	if err != nil {
-		return fmt.Errorf("mint token: %w", err)
-	}
-	k8sToken := tok.Status.Token
-	if strings.TrimSpace(k8sToken) == "" {
-		return fmt.Errorf("received empty token for SA %s", saName)
-	}
-
-	// Only store the K8s token; GitHub tokens are minted on-demand by the runner
-	secretData := map[string]string{
-		"k8s-token": k8sToken,
-	}
-
-	// Store token in a Secret (update if exists to refresh token)
-	secretName := fmt.Sprintf("ambient-runner-token-%s", sessionName)
-	refreshedAt := time.Now().UTC().Format(time.RFC3339)
-	sec := &corev1.Secret{
-		ObjectMeta: v1.ObjectMeta{
-			Name:            secretName,
-			Namespace:       project,
-			Labels:          map[string]string{"app": "ambient-runner-token"},
-			OwnerReferences: []v1.OwnerReference{ownerRef},
-			Annotations: map[string]string{
-				runnerTokenRefreshedAtAnnotation: refreshedAt,
-			},
-		},
-		Type:       corev1.SecretTypeOpaque,
-		StringData: secretData,
-	}
-
-	// Try to create the secret
-	if _, err := reqK8s.CoreV1().Secrets(project).Create(c.Request.Context(), sec, v1.CreateOptions{}); err != nil {
-		if errors.IsAlreadyExists(err) {
-			// Secret exists - update it with fresh token
-			log.Printf("Updating existing secret %s with fresh token", secretName)
-			existing, getErr := reqK8s.CoreV1().Secrets(project).Get(c.Request.Context(), secretName, v1.GetOptions{})
-			if getErr != nil {
-				return fmt.Errorf("get Secret for update: %w", getErr)
-			}
-			secretCopy := existing.DeepCopy()
-			if secretCopy.Data == nil {
-				secretCopy.Data = map[string][]byte{}
-			}
-			secretCopy.Data["k8s-token"] = []byte(k8sToken)
-			if secretCopy.Annotations == nil {
-				secretCopy.Annotations = map[string]string{}
-			}
-			secretCopy.Annotations[runnerTokenRefreshedAtAnnotation] = refreshedAt
-			if _, err := reqK8s.CoreV1().Secrets(project).Update(c.Request.Context(), secretCopy, v1.UpdateOptions{}); err != nil {
-				return fmt.Errorf("update Secret: %w", err)
-			}
-			log.Printf("Successfully updated secret %s with fresh token", secretName)
-		} else {
-			return fmt.Errorf("create Secret: %w", err)
-		}
-	}
-
-	// Annotate the AgenticSession with the Secret and SA names (conflict-safe patch)
-	patch := map[string]interface{}{
-		"metadata": map[string]interface{}{
-			"annotations": map[string]string{
-				"ambient-code.io/runner-token-secret": secretName,
-				"ambient-code.io/runner-sa":           saName,
-			},
-		},
-	}
-	b, err := json.Marshal(patch)
-	if err != nil {
-		return fmt.Errorf("marshal patch: %w", err)
-	}
-	if _, err := reqDyn.Resource(gvr).Namespace(project).Patch(c.Request.Context(), obj.GetName(), ktypes.MergePatchType, b, v1.PatchOptions{}); err != nil {
-		return fmt.Errorf("annotate AgenticSession: %w", err)
-	}
-
-	return nil
-}
-
 func GetSession(c *gin.Context) {
 	project := c.GetString("project")
 	sessionName := c.Param("sessionName")
@@ -1574,6 +1399,13 @@ func RemoveRepo(c *gin.Context) {
 }
 
 // GetWorkflowMetadata retrieves commands and agents metadata from the active workflow
+// getContentServiceName returns the ambient-content service name for a session
+// Temp-content pods are deprecated - sessions must be running to access workspace
+func getContentServiceName(session string) string {
+	return fmt.Sprintf("ambient-content-%s", session)
+}
+
+// GetWorkflowMetadata retrieves the workflow metadata for an agentic session
 // GET /api/projects/:projectName/agentic-sessions/:sessionName/workflow/metadata
 func GetWorkflowMetadata(c *gin.Context) {
 	project := c.GetString("project")
@@ -1594,21 +1426,8 @@ func GetWorkflowMetadata(c *gin.Context) {
 		token = c.GetHeader("X-Forwarded-Access-Token")
 	}
 
-	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", sessionName)
-	// Use the dependency-injected client selection function
-	reqK8s, _ := GetK8sClientsForRequest(c)
-	if reqK8s == nil {
-		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
-		c.Abort()
-		return
-	}
-	if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		// Temp service doesn't exist, use regular service
-		serviceName = fmt.Sprintf("ambient-content-%s", sessionName)
-	} else {
-		serviceName = fmt.Sprintf("ambient-content-%s", sessionName)
-	}
+	// Use ambient-content service (per-session content service)
+	serviceName := fmt.Sprintf("ambient-content-%s", sessionName)
 
 	// Build URL to content service
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
@@ -2049,18 +1868,10 @@ func StartSession(c *gin.Context) {
 		return
 	}
 
-	// Check if this is a continuation (session is in a terminal phase)
-	isActualContinuation := false
+	// Log current phase for debugging
 	if currentStatus, ok := item.Object["status"].(map[string]interface{}); ok {
 		if phase, ok := currentStatus["phase"].(string); ok {
-			terminalPhases := []string{"Completed", "Failed", "Stopped", "Error"}
-			for _, terminalPhase := range terminalPhases {
-				if phase == terminalPhase {
-					isActualContinuation = true
-					log.Printf("StartSession: Detected continuation - session is in terminal phase: %s", phase)
-					break
-				}
-			}
+			log.Printf("StartSession: Current phase is %s", phase)
 		}
 	}
 
@@ -2074,10 +1885,16 @@ func StartSession(c *gin.Context) {
 	annotations["ambient-code.io/desired-phase"] = "Running"
 	annotations["ambient-code.io/start-requested-at"] = time.Now().Format(time.RFC3339)
 
-	// For continuations, set parent-session-id so operator reuses PVC
-	if isActualContinuation {
-		annotations["vteam.ambient-code/parent-session-id"] = sessionName
-		log.Printf("StartSession: Continuation detected - set parent-session-id=%s for PVC reuse", sessionName)
+	// Clean up self-referential parent-session-id annotations.
+	// Old code used to set parent-session-id to the session's own name for PVC reuse,
+	// but this caused the runner to skip INITIAL_PROMPT thinking it was a continuation.
+	// With S3 storage, we don't need this anymore. Session state persists via S3 sync.
+	// Keep legitimate parent-session-id annotations (pointing to a DIFFERENT session).
+	if existingParent, ok := annotations["vteam.ambient-code/parent-session-id"]; ok {
+		if existingParent == sessionName {
+			log.Printf("StartSession: Clearing self-referential parent-session-id annotation")
+			delete(annotations, "vteam.ambient-code/parent-session-id")
+		}
 	}
 
 	item.SetAnnotations(annotations)
@@ -2230,109 +2047,25 @@ func StopSession(c *gin.Context) {
 	c.JSON(http.StatusAccepted, session)
 }
 
-// EnableWorkspaceAccess requests a temporary content pod for workspace access on stopped sessions
+// EnableWorkspaceAccess is deprecated - temporary content pods have been removed
 // POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/enable
 func EnableWorkspaceAccess(c *gin.Context) {
-	project := c.GetString("project")
-	sessionName := c.Param("sessionName")
-	gvr := GetAgenticSessionV1Alpha1Resource()
-
-	_, k8sDyn := GetK8sClientsForRequest(c)
-	if k8sDyn == nil {
-		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
-		c.Abort()
-		return
-	}
-
-	item, err := k8sDyn.Resource(gvr).Namespace(project).Get(context.TODO(), sessionName, v1.GetOptions{})
-	if err != nil {
-		if errors.IsNotFound(err) {
-			c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"})
-			return
-		}
-		c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get session"})
-		return
-	}
-
-	// Only allow for stopped/completed/failed sessions
-	status, _ := item.Object["status"].(map[string]interface{})
-	phase, _ := status["phase"].(string)
-	if phase != "Stopped" && phase != "Completed" && phase != "Failed" {
-		c.JSON(http.StatusConflict, gin.H{"error": "Workspace access only available for stopped sessions"})
-		return
-	}
-
-	// Set annotation to request temp pod
-	annotations := item.GetAnnotations()
-	if annotations == nil {
-		annotations = make(map[string]string)
-	}
-	now := time.Now().UTC().Format(time.RFC3339)
-	annotations["ambient-code.io/temp-content-requested"] = "true"
-	annotations["ambient-code.io/temp-content-last-accessed"] = now
-	item.SetAnnotations(annotations)
-
-	// Update CR
-	updated, err := k8sDyn.Resource(gvr).Namespace(project).Update(context.TODO(), item, v1.UpdateOptions{})
-	if err != nil {
-		c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to enable workspace access"})
-		return
-	}
-
-	session := types.AgenticSession{
-		APIVersion: updated.GetAPIVersion(),
-		Kind:       updated.GetKind(),
-		Metadata:   updated.Object["metadata"].(map[string]interface{}),
-	}
-	if spec, ok := updated.Object["spec"].(map[string]interface{}); ok {
-		session.Spec = parseSpec(spec)
-	}
-	if status, ok := updated.Object["status"].(map[string]interface{}); ok {
-		session.Status = parseStatus(status)
-	}
-
-	log.Printf("EnableWorkspaceAccess: Set temp-content-requested annotation for %s", sessionName)
-	c.JSON(http.StatusAccepted, session)
+	c.JSON(http.StatusGone, gin.H{
+		"error":   "Temporary workspace access has been removed",
+		"message": "Session artifacts are now stored in S3. Access artifacts directly from your S3 bucket.",
+		"hint":    "Configure S3 storage in project settings to persist session state and artifacts.",
+		"s3Path":  fmt.Sprintf("s3://{bucket}/{namespace}/%s/", c.Param("sessionName")),
+	})
 }
 
 // TouchWorkspaceAccess updates the last-accessed timestamp to keep temp pod alive
 // POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/touch
 func TouchWorkspaceAccess(c *gin.Context) {
-	project := c.GetString("project")
-	sessionName := c.Param("sessionName")
-	gvr := GetAgenticSessionV1Alpha1Resource()
-
-	_, k8sDyn := GetK8sClientsForRequest(c)
-	if k8sDyn == nil {
-		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
-		c.Abort()
-		return
-	}
-
-	item, err := k8sDyn.Resource(gvr).Namespace(project).Get(context.TODO(), sessionName, v1.GetOptions{})
-	if err != nil {
-		if errors.IsNotFound(err) {
-			c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"})
-			return
-		}
-		c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to get session"})
-		return
-	}
-
-	annotations := item.GetAnnotations()
-	if annotations == nil {
-		annotations = make(map[string]string)
-	}
-	annotations["ambient-code.io/temp-content-last-accessed"] = time.Now().UTC().Format(time.RFC3339)
-	item.SetAnnotations(annotations)
-
-	if _, err := k8sDyn.Resource(gvr).Namespace(project).Update(context.TODO(), item, v1.UpdateOptions{}); err != nil {
-		c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to update timestamp"})
-		return
-	}
-
-	log.Printf("TouchWorkspaceAccess: Updated last-accessed timestamp for %s", sessionName)
-	c.JSON(http.StatusOK, gin.H{"message": "Workspace access timestamp updated"})
+	// Deprecated: Temp-content pods no longer exist
+	c.JSON(http.StatusGone, gin.H{
+		"error":   "Temporary workspace access has been removed",
+		"message": "Session artifacts are stored in S3 and do not require touch/keepalive.",
+	})
 }
 
 // GetSessionK8sResources returns job, pod, and PVC information for a session
@@ -2449,64 +2182,12 @@ func GetSessionK8sResources(c *gin.Context) {
 		}
 	}
 
-	// Check for temp-content pod
-	tempPodName := fmt.Sprintf("temp-content-%s", sessionName)
-	tempPod, err := k8sClt.CoreV1().Pods(project).Get(c.Request.Context(), tempPodName, v1.GetOptions{})
-	if err == nil {
-		tempPodPhase := string(tempPod.Status.Phase)
-		if tempPod.DeletionTimestamp != nil {
-			tempPodPhase = "Terminating"
-		}
-
-		containerInfos := []map[string]interface{}{}
-		for _, cs := range tempPod.Status.ContainerStatuses {
-			state := "Unknown"
-			var exitCode *int32
-			var reason string
-			if cs.State.Running != nil {
-				state = "Running"
-				// If pod is terminating but container still shows running, mark as terminating
-				if tempPod.DeletionTimestamp != nil {
-					state = "Terminating"
-				}
-			} else if cs.State.Terminated != nil {
-				state = "Terminated"
-				exitCode = &cs.State.Terminated.ExitCode
-				reason = cs.State.Terminated.Reason
-			} else if cs.State.Waiting != nil {
-				state = "Waiting"
-				reason = cs.State.Waiting.Reason
-			}
-			containerInfos = append(containerInfos, map[string]interface{}{
-				"name":     cs.Name,
-				"state":    state,
-				"exitCode": exitCode,
-				"reason":   reason,
-			})
-		}
-		podInfos = append(podInfos, map[string]interface{}{
-			"name":       tempPod.Name,
-			"phase":      tempPodPhase,
-			"containers": containerInfos,
-			"isTempPod":  true,
-		})
-	}
-
 	result["pods"] = podInfos
 
-	// Get PVC info - always use session's own PVC name
-	// Note: If session was created with parent_session_id (via API), the operator handles PVC reuse
-	pvcName := fmt.Sprintf("ambient-workspace-%s", sessionName)
-	pvc, err := k8sClt.CoreV1().PersistentVolumeClaims(project).Get(c.Request.Context(), pvcName, v1.GetOptions{})
-	result["pvcName"] = pvcName
-	if err == nil {
-		result["pvcExists"] = true
-		if storage, ok := pvc.Status.Capacity[corev1.ResourceStorage]; ok {
-			result["pvcSize"] = storage.String()
-		}
-	} else {
-		result["pvcExists"] = false
-	}
+	// PVCs deprecated - sessions now use EmptyDir with S3 state persistence
+	result["pvcExists"] = false
+	result["pvcName"] = "N/A (using EmptyDir + S3)"
+	result["storageMode"] = "EmptyDir + S3"
 
 	c.JSON(http.StatusOK, result)
 }
@@ -2529,10 +2210,11 @@ func ListSessionWorkspace(c *gin.Context) {
 	}
 
 	rel := strings.TrimSpace(c.Query("path"))
-	// Build absolute workspace path using plain session (no url.PathEscape to match FS paths)
-	absPath := "/sessions/" + session + "/workspace"
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	// Content service handles the base path, so we just pass the relative path
+	absPath := ""
 	if rel != "" {
-		absPath += "/" + rel
+		absPath = rel
 	}
 
 	// Call per-job service or temp service for completed sessions
@@ -2541,19 +2223,8 @@ func ListSessionWorkspace(c *gin.Context) {
 		token = c.GetHeader("X-Forwarded-Access-Token")
 	}
 
-	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
-	// AuthN: require user token before probing K8s Services
-	k8sClt, _ := GetK8sClientsForRequest(c)
-	if k8sClt == nil {
-		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
-		c.Abort()
-		return
-	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		// Temp service doesn't exist, use regular service
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
+	// Use ambient-content service (per-session content service)
+	serviceName := fmt.Sprintf("ambient-content-%s", session)
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
 	u := fmt.Sprintf("%s/content/list?path=%s", endpoint, url.QueryEscape(absPath))
@@ -2615,23 +2286,15 @@ func GetSessionWorkspaceFile(c *gin.Context) {
 	}
 
 	sub := strings.TrimPrefix(c.Param("path"), "/")
-	absPath := "/sessions/" + session + "/workspace/" + sub
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := sub
 	token := c.GetHeader("Authorization")
 	if strings.TrimSpace(token) == "" {
 		token = c.GetHeader("X-Forwarded-Access-Token")
 	}
 
-	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
-	k8sClt, _ := GetK8sClientsForRequest(c)
-	if k8sClt == nil {
-		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
-		c.Abort()
-		return
-	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
+	// Use ambient-content service (per-session content service)
+	serviceName := fmt.Sprintf("ambient-content-%s", session)
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
 	u := fmt.Sprintf("%s/content/file?path=%s", endpoint, url.QueryEscape(absPath))
@@ -2692,22 +2355,22 @@ func PutSessionWorkspaceFile(c *gin.Context) {
 	// Validate and sanitize path to prevent directory traversal
 	// Use robust path validation that works across platforms
 	sub := strings.TrimPrefix(c.Param("path"), "/")
-	workspaceBase := "/sessions/" + session + "/workspace"
+	workspaceBase := "/workspace"
 
-	// Construct absolute path using filepath.Join for proper path handling
-	absPath := filepath.Join(workspaceBase, sub)
+	// Construct absolute path using filepath.Join for path validation
+	validationPath := filepath.Join(workspaceBase, sub)
 
 	// Use robust path validation from pathutil package
 	// This is more secure than manual string checks and works across platforms
-	if !pathutil.IsPathWithinBase(absPath, workspaceBase) {
-		log.Printf("PutSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", absPath, workspaceBase)
+	if !pathutil.IsPathWithinBase(validationPath, workspaceBase) {
+		log.Printf("PutSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", validationPath, workspaceBase)
 		c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid path: must be within workspace directory"})
 		return
 	}
 
+	// Use relative path for content service (it has its own StateBaseDir=/workspace)
 	// Convert to forward slashes for content service (expects POSIX paths)
-	// filepath.Join may use backslashes on Windows, but content service always uses forward slashes
-	absPath = filepath.ToSlash(absPath)
+	absPath := filepath.ToSlash(sub)
 
 	token := c.GetHeader("Authorization")
 	if strings.TrimSpace(token) == "" {
@@ -2740,7 +2403,7 @@ func PutSessionWorkspaceFile(c *gin.Context) {
 	// Verify session exists using reqDyn AFTER RBAC check
 	// This prevents enumeration attacks - unauthorized users get same "Forbidden" response
 	gvr := GetAgenticSessionV1Alpha1Resource()
-	item, err := reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), session, v1.GetOptions{})
+	_, err = reqDyn.Resource(gvr).Namespace(project).Get(c.Request.Context(), session, v1.GetOptions{})
 	if err != nil {
 		if errors.IsNotFound(err) {
 			c.JSON(http.StatusNotFound, gin.H{"error": "Session not found"})
@@ -2750,60 +2413,15 @@ func PutSessionWorkspaceFile(c *gin.Context) {
 		return
 	}
 
-	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
-	serviceFound := false
-
+	// Check if ambient-content service exists (session must be running)
+	serviceName := fmt.Sprintf("ambient-content-%s", session)
 	if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		// Temp service doesn't exist, try regular service
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-		if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-			// Neither service exists - need to spawn temp content pod
-			log.Printf("PutSessionWorkspaceFile: No content service found for session %s, requesting temp pod", session)
-			serviceFound = false
-		} else {
-			serviceFound = true
-		}
-	} else {
-		serviceFound = true
-	}
-
-	// If no service exists, request temp content pod and return accepted status
-	// We already have the session item from the existence check above
-	if !serviceFound {
-
-		// Check if temp content was already requested (avoid duplicate pod creation)
-		annotations := item.GetAnnotations()
-		if annotations != nil && annotations["ambient-code.io/temp-content-requested"] == "true" {
-			log.Printf("PutSessionWorkspaceFile: Temp content already requested for session %s", session)
-			c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"})
-			return
-		}
-
-		// Request temp content pod via annotation
-		if annotations == nil {
-			annotations = make(map[string]string)
-		}
-		now := time.Now().UTC().Format(time.RFC3339)
-		annotations["ambient-code.io/temp-content-requested"] = "true"
-		annotations["ambient-code.io/temp-content-last-accessed"] = now
-		item.SetAnnotations(annotations)
-
-		// Use optimistic locking - if resource was modified between Get and Update, K8s returns conflict
-		if _, err := reqDyn.Resource(gvr).Namespace(project).Update(c.Request.Context(), item, v1.UpdateOptions{}); err != nil {
-			if errors.IsConflict(err) {
-				// Another request updated the resource - likely also requested temp pod
-				log.Printf("PutSessionWorkspaceFile: Conflict updating session %s (concurrent request), treating as already requested", session)
-				c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"})
-				return
-			}
-			log.Printf("PutSessionWorkspaceFile: Failed to request temp pod: %v", err)
-			c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available, please try again in a few seconds"})
-			return
-		}
-
-		log.Printf("PutSessionWorkspaceFile: Requested temp content pod for session %s", session)
-		c.JSON(http.StatusAccepted, gin.H{"message": "Content service starting, please retry upload in a few seconds"})
+		// Service doesn't exist - session is not running
+		log.Printf("PutSessionWorkspaceFile: Content service not found for session %s (session not running)", session)
+		c.JSON(http.StatusConflict, gin.H{
+			"error": "Session is not running. Start the session to upload files.",
+			"hint":  "File uploads require an active session. Start the session and try again.",
+		})
 		return
 	}
 
@@ -2910,22 +2528,22 @@ func DeleteSessionWorkspaceFile(c *gin.Context) {
 	// Validate and sanitize path to prevent directory traversal
 	// Use robust path validation that works across platforms
 	sub := strings.TrimPrefix(c.Param("path"), "/")
-	workspaceBase := "/sessions/" + session + "/workspace"
+	workspaceBase := "/workspace"
 
-	// Construct absolute path using filepath.Join for proper path handling
-	absPath := filepath.Join(workspaceBase, sub)
+	// Construct absolute path using filepath.Join for path validation
+	validationPath := filepath.Join(workspaceBase, sub)
 
 	// Use robust path validation from pathutil package
 	// This is more secure than manual string checks and works across platforms
-	if !pathutil.IsPathWithinBase(absPath, workspaceBase) {
-		log.Printf("DeleteSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", absPath, workspaceBase)
+	if !pathutil.IsPathWithinBase(validationPath, workspaceBase) {
+		log.Printf("DeleteSessionWorkspaceFile: path traversal attempt detected - path=%q escapes workspace=%q", validationPath, workspaceBase)
 		c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid path: must be within workspace directory"})
 		return
 	}
 
+	// Use relative path for content service (it has its own StateBaseDir=/workspace)
 	// Convert to forward slashes for content service (expects POSIX paths)
-	// filepath.Join may use backslashes on Windows, but content service always uses forward slashes
-	absPath = filepath.ToSlash(absPath)
+	absPath := filepath.ToSlash(sub)
 
 	token := c.GetHeader("Authorization")
 	if strings.TrimSpace(token) == "" {
@@ -2968,26 +2586,11 @@ func DeleteSessionWorkspaceFile(c *gin.Context) {
 		return
 	}
 
-	// Try temp service first, then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
-	serviceFound := false
-
+	// Check if content service exists (session must be running)
+	serviceName := getContentServiceName(session)
 	if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		// Temp service doesn't exist, try regular service
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-		if _, err := reqK8s.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-			log.Printf("DeleteSessionWorkspaceFile: No content service found for session %s", session)
-			c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available"})
-			return
-		} else {
-			serviceFound = true
-		}
-	} else {
-		serviceFound = true
-	}
-
-	if !serviceFound {
-		c.JSON(http.StatusServiceUnavailable, gin.H{"error": "Content service not available"})
+		log.Printf("DeleteSessionWorkspaceFile: Content service not found for session %s (session not running)", session)
+		c.JSON(http.StatusConflict, gin.H{"error": "Session is not running. Start the session to access files."})
 		return
 	}
 
@@ -3060,16 +2663,13 @@ func PushSessionRepo(c *gin.Context) {
 	log.Printf("pushSessionRepo: request project=%s session=%s repoIndex=%d commitLen=%d", project, session, body.RepoIndex, len(strings.TrimSpace(body.CommitMessage)))
 
 	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, k8sDyn := GetK8sClientsForRequest(c)
 	if k8sClt == nil || k8sDyn == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
 	log.Printf("pushSessionRepo: using service %s", serviceName)
 
@@ -3092,11 +2692,12 @@ func PushSessionRepo(c *gin.Context) {
 	}
 	rm, _ := repos[body.RepoIndex].(map[string]interface{})
 	// Derive repoPath from input URL folder name
+	// Paths are relative to content service's StateBaseDir (which is /workspace)
 	if in, ok := rm["input"].(map[string]interface{}); ok {
 		if urlv, ok2 := in["url"].(string); ok2 && strings.TrimSpace(urlv) != "" {
 			folder := DeriveRepoFolderFromURL(strings.TrimSpace(urlv))
 			if folder != "" {
-				resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace/%s", session, folder)
+				resolvedRepoPath = folder
 			}
 		}
 	}
@@ -3113,9 +2714,9 @@ func PushSessionRepo(c *gin.Context) {
 	// If input URL missing or unparsable, fall back to numeric index path (last resort)
 	if strings.TrimSpace(resolvedRepoPath) == "" {
 		if body.RepoIndex >= 0 {
-			resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace/%d", session, body.RepoIndex)
+			resolvedRepoPath = fmt.Sprintf("%d", body.RepoIndex)
 		} else {
-			resolvedRepoPath = fmt.Sprintf("/sessions/%s/workspace", session)
+			resolvedRepoPath = ""
 		}
 	}
 	if strings.TrimSpace(resolvedOutputURL) == "" {
@@ -3229,24 +2830,21 @@ func AbandonSessionRepo(c *gin.Context) {
 	}
 
 	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
 	log.Printf("AbandonSessionRepo: using service %s", serviceName)
 	repoPath := strings.TrimSpace(body.RepoPath)
 	if repoPath == "" {
 		if body.RepoIndex >= 0 {
-			repoPath = fmt.Sprintf("/sessions/%s/workspace/%d", session, body.RepoIndex)
+			repoPath = fmt.Sprintf("%d", body.RepoIndex)
 		} else {
-			repoPath = fmt.Sprintf("/sessions/%s/workspace", session)
+			repoPath = ""
 		}
 	}
 	payload := map[string]interface{}{
@@ -3302,8 +2900,9 @@ func DiffSessionRepo(c *gin.Context) {
 	session := c.Param("sessionName")
 	repoIndexStr := strings.TrimSpace(c.Query("repoIndex"))
 	repoPath := strings.TrimSpace(c.Query("repoPath"))
+	// Paths are relative to content service's StateBaseDir (which is /workspace)
 	if repoPath == "" && repoIndexStr != "" {
-		repoPath = fmt.Sprintf("/sessions/%s/workspace/%s", session, repoIndexStr)
+		repoPath = repoIndexStr
 	}
 	if repoPath == "" {
 		c.JSON(http.StatusBadRequest, gin.H{"error": "missing repoPath/repoIndex"})
@@ -3311,16 +2910,13 @@ func DiffSessionRepo(c *gin.Context) {
 	}
 
 	// Try temp service first (for completed sessions), then regular service
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080", serviceName, project)
 	log.Printf("DiffSessionRepo: using service %s", serviceName)
 	url := fmt.Sprintf("%s/content/github/diff?repoPath=%s", endpoint, url.QueryEscape(repoPath))
@@ -3372,20 +2968,17 @@ func GetGitStatus(c *gin.Context) {
 		return
 	}
 
-	// Build absolute path
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := relativePath
 
 	// Get content service endpoint
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-status?path=%s", serviceName, project, url.QueryEscape(absPath))
 
@@ -3443,20 +3036,17 @@ func ConfigureGitRemote(c *gin.Context) {
 		body.Branch = "main"
 	}
 
-	// Build absolute path
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", sessionName, body.Path)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := body.Path
 
 	// Get content service endpoint
-	serviceName := fmt.Sprintf("temp-content-%s", sessionName)
+	serviceName := getContentServiceName(sessionName)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", sessionName)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-configure-remote", serviceName, project)
 
@@ -3560,20 +3150,17 @@ func SynchronizeGit(c *gin.Context) {
 		body.Message = fmt.Sprintf("Session %s - %s", session, time.Now().Format(time.RFC3339))
 	}
 
-	// Build absolute path
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := body.Path
 
 	// Get content service endpoint
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-sync", serviceName, project)
 
@@ -3630,18 +3217,16 @@ func GetGitMergeStatus(c *gin.Context) {
 		branch = "main"
 	}
 
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := relativePath
 
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-merge-status?path=%s&branch=%s",
 		serviceName, project, url.QueryEscape(absPath), url.QueryEscape(branch))
@@ -3690,18 +3275,16 @@ func GitPullSession(c *gin.Context) {
 		body.Branch = "main"
 	}
 
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := body.Path
 
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-pull", serviceName, project)
 
@@ -3769,18 +3352,16 @@ func GitPushSession(c *gin.Context) {
 		body.Message = fmt.Sprintf("Session %s artifacts", session)
 	}
 
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := body.Path
 
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-push", serviceName, project)
 
@@ -3842,18 +3423,16 @@ func GitCreateBranchSession(c *gin.Context) {
 		body.Path = "artifacts"
 	}
 
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, body.Path)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := body.Path
 
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-create-branch", serviceName, project)
 
@@ -3905,18 +3484,16 @@ func GitListBranchesSession(c *gin.Context) {
 		relativePath = "artifacts"
 	}
 
-	absPath := fmt.Sprintf("/sessions/%s/workspace/%s", session, relativePath)
+	// Path is relative to content service's StateBaseDir (which is /workspace)
+	absPath := relativePath
 
-	serviceName := fmt.Sprintf("temp-content-%s", session)
+	serviceName := getContentServiceName(session)
 	k8sClt, _ := GetK8sClientsForRequest(c)
 	if k8sClt == nil {
 		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
 		c.Abort()
 		return
 	}
-	if _, err := k8sClt.CoreV1().Services(project).Get(c.Request.Context(), serviceName, v1.GetOptions{}); err != nil {
-		serviceName = fmt.Sprintf("ambient-content-%s", session)
-	}
 
 	endpoint := fmt.Sprintf("http://%s.%s.svc:8080/content/git-list-branches?path=%s",
 		serviceName, project, url.QueryEscape(absPath))
diff --git a/components/backend/server/server.go b/components/backend/server/server.go
index a6465e055..5e05caba4 100644
--- a/components/backend/server/server.go
+++ b/components/backend/server/server.go
@@ -2,11 +2,15 @@
 package server
 
 import (
+	"context"
 	"fmt"
 	"log"
 	"net/http"
 	"os"
+	"os/signal"
 	"strings"
+	"syscall"
+	"time"
 
 	"github.com/gin-contrib/cors"
 	"github.com/gin-gonic/gin"
@@ -95,7 +99,7 @@ func forwardedIdentityMiddleware() gin.HandlerFunc {
 	}
 }
 
-// RunContentService starts the server in content service mode
+// RunContentService starts the server in content service mode with graceful shutdown
 func RunContentService(registerContentRoutes RouterFunc) error {
 	r := gin.New()
 	r.Use(gin.Recovery())
@@ -124,9 +128,39 @@ func RunContentService(registerContentRoutes RouterFunc) error {
 	if port == "" {
 		port = "8080"
 	}
-	log.Printf("Content service starting on port %s", port)
-	if err := r.Run(":" + port); err != nil {
-		return fmt.Errorf("failed to start content service: %v", err)
+
+	// Create HTTP server for graceful shutdown
+	srv := &http.Server{
+		Addr:    ":" + port,
+		Handler: r,
+	}
+
+	// Channel to receive shutdown signal
+	quit := make(chan os.Signal, 1)
+	signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
+
+	// Start server in goroutine
+	go func() {
+		log.Printf("Content service starting on port %s", port)
+		if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
+			log.Fatalf("Content service listen error: %v", err)
+		}
+	}()
+
+	// Wait for shutdown signal
+	sig := <-quit
+	log.Printf("Content service received signal %v, shutting down gracefully...", sig)
+
+	// Create shutdown context with timeout
+	ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
+	defer cancel()
+
+	// Attempt graceful shutdown
+	if err := srv.Shutdown(ctx); err != nil {
+		log.Printf("Content service forced to shutdown: %v", err)
+		return err
 	}
+
+	log.Println("Content service shutdown complete")
 	return nil
 }
diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
index 1776504c9..da3c28e76 100644
--- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
+++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
@@ -90,7 +90,6 @@ import {
   useSession,
   useStopSession,
   useDeleteSession,
-  useSessionK8sResources,
   useContinueSession,
 } from "@/services/queries";
 import {
@@ -192,10 +191,6 @@ export default function ProjectSessionDetailPage({
     error,
     refetch: refetchSession,
   } = useSession(projectName, sessionName);
-  const { data: k8sResources } = useSessionK8sResources(
-    projectName,
-    sessionName,
-  );
   const stopMutation = useStopSession();
   const deleteMutation = useDeleteSession();
   const continueMutation = useContinueSession();
@@ -1256,9 +1251,6 @@ export default function ProjectSessionDetailPage({
     );
   };
 
-  // Duration calculation removed - startTime/completionTime no longer in status
-  const durationMs = undefined;
-
   // Loading state
   if (isLoading || !projectName || !sessionName) {
     return (
@@ -1383,9 +1375,6 @@ export default function ProjectSessionDetailPage({
                   onStop={handleStop}
                   onContinue={handleContinue}
                   onDelete={handleDelete}
-                  durationMs={durationMs}
-                  k8sResources={k8sResources}
-                  messageCount={aguiState.messages.length}
                   renderMode="kebab-only"
                 />
               </div>
diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx
index 0f23916cd..985e49d87 100644
--- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx
+++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/session-header.tsx
@@ -19,12 +19,6 @@ type SessionHeaderProps = {
   onStop: () => void;
   onContinue: () => void;
   onDelete: () => void;
-  durationMs?: number;
-  k8sResources?: {
-    pvcName?: string;
-    pvcSize?: string;
-  };
-  messageCount: number;
   renderMode?: 'full' | 'actions-only' | 'kebab-only';
 };
 
@@ -36,9 +30,6 @@ export function SessionHeader({
   onStop,
   onContinue,
   onDelete,
-  durationMs,
-  k8sResources,
-  messageCount,
   renderMode = 'full',
 }: SessionHeaderProps) {
   const [detailsModalOpen, setDetailsModalOpen] = useState(false);
@@ -146,9 +137,6 @@ export function SessionHeader({
           projectName={projectName}
           open={detailsModalOpen}
           onOpenChange={setDetailsModalOpen}
-          durationMs={durationMs}
-          k8sResources={k8sResources}
-          messageCount={messageCount}
         />
         
         <EditSessionNameDialog
@@ -287,9 +275,6 @@ export function SessionHeader({
         projectName={projectName}
         open={detailsModalOpen}
         onOpenChange={setDetailsModalOpen}
-        durationMs={durationMs}
-        k8sResources={k8sResources}
-        messageCount={messageCount}
       />
       
       <EditSessionNameDialog
diff --git a/components/frontend/src/components/session-details-modal.tsx b/components/frontend/src/components/session-details-modal.tsx
index 2d219cf96..a94c01bac 100644
--- a/components/frontend/src/components/session-details-modal.tsx
+++ b/components/frontend/src/components/session-details-modal.tsx
@@ -4,40 +4,18 @@ import { useCallback, useState } from 'react';
 import { Dialog, DialogContent, DialogHeader, DialogTitle } from '@/components/ui/dialog';
 import { Badge } from '@/components/ui/badge';
 import { Button } from '@/components/ui/button';
+import { Accordion, AccordionContent, AccordionItem, AccordionTrigger } from '@/components/ui/accordion';
 import { Download, Loader2 } from 'lucide-react';
 import type { AgenticSession } from '@/types/agentic-session';
 import { getPhaseColor } from '@/utils/session-helpers';
 import { successToast } from '@/hooks/use-toast';
 import { useSessionExport } from '@/services/queries/use-sessions';
 
-function formatDuration(ms: number): string {
-  const seconds = Math.floor(ms / 1000);
-  const minutes = Math.floor(seconds / 60);
-  const hours = Math.floor(minutes / 60);
-  
-  if (hours > 0) {
-    const remainingMinutes = minutes % 60;
-    const remainingSeconds = seconds % 60;
-    return `${hours}h ${remainingMinutes}m ${remainingSeconds}s`;
-  } else if (minutes > 0) {
-    const remainingSeconds = seconds % 60;
-    return `${minutes}m ${remainingSeconds}s`;
-  } else {
-    return `${seconds}s`;
-  }
-}
-
 type SessionDetailsModalProps = {
   session: AgenticSession;
   projectName: string;
   open: boolean;
   onOpenChange: (open: boolean) => void;
-  durationMs?: number;
-  k8sResources?: {
-    pvcName?: string;
-    pvcSize?: string;
-  };
-  messageCount: number;
 };
 
 export function SessionDetailsModal({
@@ -45,9 +23,6 @@ export function SessionDetailsModal({
   projectName,
   open,
   onOpenChange,
-  durationMs,
-  k8sResources,
-  messageCount,
 }: SessionDetailsModalProps) {
   const [exportingAgui, setExportingAgui] = useState(false);
   const [exportingLegacy, setExportingLegacy] = useState(false);
@@ -113,44 +88,6 @@ export function SessionDetailsModal({
               <span className="text-foreground">{session.spec.llmSettings.model}</span>
             </div>
             
-            <div className="flex items-start gap-3">
-              <span className="font-semibold text-foreground/80 min-w-[100px]">Temperature:</span>
-              <span className="text-foreground">{session.spec.llmSettings.temperature}</span>
-            </div>
-            
-            <div className="flex items-start gap-3">
-              <span className="font-semibold text-foreground/80 min-w-[100px]">Mode:</span>
-              <span className="text-foreground">{session.spec?.interactive ? "Interactive" : "Headless"}</span>
-            </div>
-            
-            {/* startTime removed from simplified status */}
-            
-            <div className="flex items-start gap-3">
-              <span className="font-semibold text-foreground/80 min-w-[100px]">Duration:</span>
-              <span className="text-foreground">{typeof durationMs === "number" ? formatDuration(durationMs) : "-"}</span>
-            </div>
-            
-            {k8sResources?.pvcName && (
-              <div className="flex items-start gap-3">
-                <span className="font-semibold text-foreground/80 min-w-[100px]">PVC:</span>
-                <span className="text-foreground font-mono break-all">{k8sResources.pvcName}</span>
-              </div>
-            )}
-            
-            {k8sResources?.pvcSize && (
-              <div className="flex items-start gap-3">
-                <span className="font-semibold text-foreground/80 min-w-[100px]">PVC Size:</span>
-                <span className="text-foreground">{k8sResources.pvcSize}</span>
-              </div>
-            )}
-            
-            {/* jobName removed from simplified status */}
-            
-            <div className="flex items-start gap-3">
-              <span className="font-semibold text-foreground/80 min-w-[100px]">Messages:</span>
-              <span className="text-foreground">{messageCount}</span>
-            </div>
-            
             {/* Export buttons */}
             <div className="pt-2 space-y-2">
               {loadingExport ? (
@@ -210,25 +147,42 @@ export function SessionDetailsModal({
           {session.status?.conditions && session.status.conditions.length > 0 && (
             <div className="pt-4">
               <div className="text-xs uppercase tracking-wide text-gray-500 mb-2">Reconciliation Conditions</div>
-              <div className="space-y-2">
+              <Accordion type="multiple" className="w-full">
                 {session.status.conditions.map((condition, index) => (
-                  <div key={`${condition.type}-${index}`} className="rounded border px-3 py-2 text-sm">
-                    <div className="flex items-center justify-between">
-                      <span className="font-semibold">{condition.type}</span>
-                      <span className={`text-xs ${condition.status === "True" ? "text-green-600" : condition.status === "False" ? "text-red-600" : "text-yellow-600"}`}>
-                        {condition.status}
-                      </span>
-                    </div>
-                    <div className="text-xs text-gray-500">{condition.reason || "No reason provided"}</div>
-                    {condition.message && (
-                      <div className="text-sm text-gray-700 mt-1">{condition.message}</div>
-                    )}
-                    {condition.lastTransitionTime && (
-                      <div className="text-xs text-gray-400 mt-1">Updated {new Date(condition.lastTransitionTime).toLocaleString()}</div>
-                    )}
-                  </div>
+                  <AccordionItem key={`${condition.type}-${index}`} value={`condition-${index}`}>
+                    <AccordionTrigger className="py-3 px-3 hover:no-underline hover:bg-muted/50 rounded-t">
+                      <div className="flex items-center justify-between flex-1 mr-2">
+                        <span className="font-medium text-sm">{condition.type}</span>
+                        <Badge 
+                          variant={condition.status === "True" ? "default" : condition.status === "False" ? "destructive" : "secondary"}
+                          className="ml-2"
+                        >
+                          {condition.status}
+                        </Badge>
+                      </div>
+                    </AccordionTrigger>
+                    <AccordionContent className="px-3 pb-3">
+                      <div className="space-y-2 text-sm">
+                        <div>
+                          <span className="font-semibold text-foreground/70">Reason:</span>
+                          <span className="ml-2 text-foreground/90">{condition.reason || "No reason provided"}</span>
+                        </div>
+                        {condition.message && (
+                          <div>
+                            <span className="font-semibold text-foreground/70">Message:</span>
+                            <p className="mt-1 text-foreground/90">{condition.message}</p>
+                          </div>
+                        )}
+                        {condition.lastTransitionTime && (
+                          <div className="text-xs text-muted-foreground pt-2">
+                            Updated {new Date(condition.lastTransitionTime).toLocaleString()}
+                          </div>
+                        )}
+                      </div>
+                    </AccordionContent>
+                  </AccordionItem>
                 ))}
-              </div>
+              </Accordion>
             </div>
           )}
         </div>
diff --git a/components/frontend/src/components/workspace-sections/settings-section.tsx b/components/frontend/src/components/workspace-sections/settings-section.tsx
index 607da5ef8..f69ec9e4a 100644
--- a/components/frontend/src/components/workspace-sections/settings-section.tsx
+++ b/components/frontend/src/components/workspace-sections/settings-section.tsx
@@ -38,11 +38,19 @@ export function SettingsSection({ projectName }: SettingsSectionProps) {
   const [gitlabToken, setGitlabToken] = useState<string>("");
   const [gitlabInstanceUrl, setGitlabInstanceUrl] = useState<string>("");
   const [showGitlabToken, setShowGitlabToken] = useState<boolean>(false);
+  const [storageMode, setStorageMode] = useState<"shared" | "custom">("shared");
+  const [s3Endpoint, setS3Endpoint] = useState<string>("");
+  const [s3Bucket, setS3Bucket] = useState<string>("");
+  const [s3Region, setS3Region] = useState<string>("us-east-1");
+  const [s3AccessKey, setS3AccessKey] = useState<string>("");
+  const [s3SecretKey, setS3SecretKey] = useState<string>("");
+  const [showS3SecretKey, setShowS3SecretKey] = useState<boolean>(false);
   const [anthropicExpanded, setAnthropicExpanded] = useState<boolean>(false);
   const [githubExpanded, setGithubExpanded] = useState<boolean>(false);
   const [jiraExpanded, setJiraExpanded] = useState<boolean>(false);
   const [gitlabExpanded, setGitlabExpanded] = useState<boolean>(false);
-  const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY","GIT_USER_NAME","GIT_USER_EMAIL","GITHUB_TOKEN","JIRA_URL","JIRA_PROJECT","JIRA_EMAIL","JIRA_API_TOKEN","GITLAB_TOKEN","GITLAB_INSTANCE_URL"] as const, []);
+  const [s3Expanded, setS3Expanded] = useState<boolean>(false);
+  const FIXED_KEYS = useMemo(() => ["ANTHROPIC_API_KEY","GIT_USER_NAME","GIT_USER_EMAIL","GITHUB_TOKEN","JIRA_URL","JIRA_PROJECT","JIRA_EMAIL","JIRA_API_TOKEN","GITLAB_TOKEN","GITLAB_INSTANCE_URL","STORAGE_MODE","S3_ENDPOINT","S3_BUCKET","S3_REGION","S3_ACCESS_KEY","S3_SECRET_KEY"] as const, []);
 
   // React Query hooks
   const { data: project, isLoading: projectLoading } = useProject(projectName);
@@ -75,6 +83,14 @@ export function SettingsSection({ projectName }: SettingsSectionProps) {
       setJiraToken(byKey["JIRA_API_TOKEN"] || "");
       setGitlabToken(byKey["GITLAB_TOKEN"] || "");
       setGitlabInstanceUrl(byKey["GITLAB_INSTANCE_URL"] || "");
+      // Determine storage mode: "custom" if S3_ENDPOINT is set, otherwise "shared" (default)
+      const hasCustomS3 = byKey["STORAGE_MODE"] === "custom" || (byKey["S3_ENDPOINT"] && byKey["S3_ENDPOINT"] !== "");
+      setStorageMode(hasCustomS3 ? "custom" : "shared");
+      setS3Endpoint(byKey["S3_ENDPOINT"] || "");
+      setS3Bucket(byKey["S3_BUCKET"] || "");
+      setS3Region(byKey["S3_REGION"] || "us-east-1");
+      setS3AccessKey(byKey["S3_ACCESS_KEY"] || "");
+      setS3SecretKey(byKey["S3_SECRET_KEY"] || "");
       setSecrets(allSecrets.filter(s => !FIXED_KEYS.includes(s.key as typeof FIXED_KEYS[number])));
     }
   }, [runnerSecrets, integrationSecrets, FIXED_KEYS]);
@@ -147,6 +163,18 @@ export function SettingsSection({ projectName }: SettingsSectionProps) {
     if (jiraToken) integrationData["JIRA_API_TOKEN"] = jiraToken;
     if (gitlabToken) integrationData["GITLAB_TOKEN"] = gitlabToken;
     if (gitlabInstanceUrl) integrationData["GITLAB_INSTANCE_URL"] = gitlabInstanceUrl;
+    
+    // S3 Storage configuration
+    integrationData["STORAGE_MODE"] = storageMode;
+    if (storageMode === "custom") {
+      // Only save custom S3 settings when custom mode is selected
+      if (s3Endpoint) integrationData["S3_ENDPOINT"] = s3Endpoint;
+      if (s3Bucket) integrationData["S3_BUCKET"] = s3Bucket;
+      if (s3Region) integrationData["S3_REGION"] = s3Region;
+      if (s3AccessKey) integrationData["S3_ACCESS_KEY"] = s3AccessKey;
+      if (s3SecretKey) integrationData["S3_SECRET_KEY"] = s3SecretKey;
+    }
+    // If shared mode: backend will use operator defaults + minio-credentials secret
     for (const { key, value } of secrets) {
       if (!key) continue;
       if (FIXED_KEYS.includes(key as typeof FIXED_KEYS[number])) continue;
@@ -468,6 +496,137 @@ export function SettingsSection({ projectName }: SettingsSectionProps) {
             )}
           </div>
 
+          {/* S3 Storage Configuration Section */}
+          <div className="space-y-3 pt-4 border-t">
+            <div
+              className="flex items-center justify-between cursor-pointer hover:opacity-80"
+              onClick={() => setS3Expanded((v) => !v)}
+            >
+              <div>
+                <Label className="text-base font-semibold cursor-pointer">S3 Storage Configuration</Label>
+                <div className="text-xs text-muted-foreground mt-1">Configure S3-compatible storage for session artifacts and state</div>
+              </div>
+              {s3Expanded ? <ChevronDown className="w-4 h-4" /> : <ChevronRight className="w-4 h-4" />}
+            </div>
+            {s3Expanded && (
+              <div className="space-y-4 pl-1">
+                <Alert>
+                  <Info className="h-4 w-4" />
+                  <AlertTitle>Session State Storage</AlertTitle>
+                  <AlertDescription>
+                    Session artifacts, uploads, and Claude history are persisted to S3-compatible storage. By default, the cluster provides shared MinIO storage.
+                  </AlertDescription>
+                </Alert>
+                <div className="space-y-3">
+                  <Label className="text-sm font-medium">Storage Configuration</Label>
+                  <div className="space-y-2">
+                    <div className="flex items-center space-x-2">
+                      <input
+                        id="storage-shared"
+                        type="radio"
+                        name="storageMode"
+                        value="shared"
+                        checked={storageMode === "shared"}
+                        onChange={() => setStorageMode("shared")}
+                        className="h-4 w-4"
+                      />
+                      <Label htmlFor="storage-shared" className="cursor-pointer font-normal">
+                        Use shared cluster storage (default)
+                      </Label>
+                    </div>
+                    <div className="text-xs text-muted-foreground ml-6">
+                      Automatically uses in-cluster MinIO. No configuration needed.
+                    </div>
+                  </div>
+                  <div className="space-y-2">
+                    <div className="flex items-center space-x-2">
+                      <input
+                        id="storage-custom"
+                        type="radio"
+                        name="storageMode"
+                        value="custom"
+                        checked={storageMode === "custom"}
+                        onChange={() => setStorageMode("custom")}
+                        className="h-4 w-4"
+                      />
+                      <Label htmlFor="storage-custom" className="cursor-pointer font-normal">
+                        Use custom S3-compatible storage
+                      </Label>
+                    </div>
+                    <div className="text-xs text-muted-foreground ml-6">
+                      Configure AWS S3, external MinIO, or other S3-compatible endpoint.
+                    </div>
+                  </div>
+                </div>
+                {storageMode === "custom" && (
+                  <>
+                    <div className="space-y-2">
+                      <Label htmlFor="s3Endpoint">S3_ENDPOINT</Label>
+                      <div className="text-xs text-muted-foreground mb-1">S3-compatible endpoint (e.g., https://s3.amazonaws.com, http://minio.local:9000)</div>
+                      <Input
+                        id="s3Endpoint"
+                        type="text"
+                        placeholder="https://s3.amazonaws.com"
+                        value={s3Endpoint}
+                        onChange={(e) => setS3Endpoint(e.target.value)}
+                      />
+                    </div>
+                    <div className="space-y-2">
+                      <Label htmlFor="s3Bucket">S3_BUCKET</Label>
+                      <div className="text-xs text-muted-foreground mb-1">Bucket name for session storage</div>
+                      <Input
+                        id="s3Bucket"
+                        type="text"
+                        placeholder="ambient-sessions"
+                        value={s3Bucket}
+                        onChange={(e) => setS3Bucket(e.target.value)}
+                      />
+                    </div>
+                    <div className="space-y-2">
+                      <Label htmlFor="s3Region">S3_REGION</Label>
+                      <div className="text-xs text-muted-foreground mb-1">AWS region (optional, default: us-east-1)</div>
+                      <Input
+                        id="s3Region"
+                        type="text"
+                        placeholder="us-east-1"
+                        value={s3Region}
+                        onChange={(e) => setS3Region(e.target.value)}
+                      />
+                    </div>
+                    <div className="space-y-2">
+                      <Label htmlFor="s3AccessKey">S3_ACCESS_KEY</Label>
+                      <div className="text-xs text-muted-foreground mb-1">S3 access key ID</div>
+                      <Input
+                        id="s3AccessKey"
+                        type="text"
+                        placeholder="AKIAIOSFODNN7EXAMPLE"
+                        value={s3AccessKey}
+                        onChange={(e) => setS3AccessKey(e.target.value)}
+                      />
+                    </div>
+                    <div className="space-y-2">
+                      <Label htmlFor="s3SecretKey">S3_SECRET_KEY</Label>
+                      <div className="text-xs text-muted-foreground mb-1">S3 secret access key</div>
+                      <div className="flex items-center gap-2">
+                        <Input
+                          id="s3SecretKey"
+                          type={showS3SecretKey ? "text" : "password"}
+                          placeholder="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
+                          value={s3SecretKey}
+                          onChange={(e) => setS3SecretKey(e.target.value)}
+                          className="flex-1"
+                        />
+                        <Button type="button" variant="ghost" size="sm" onClick={() => setShowS3SecretKey((v) => !v)} aria-label={showS3SecretKey ? "Hide secret" : "Show secret"}>
+                          {showS3SecretKey ? <EyeOff className="w-4 h-4" /> : <Eye className="w-4 h-4" />}
+                        </Button>
+                      </div>
+                    </div>
+                  </>
+                )}
+              </div>
+            )}
+          </div>
+
           {/* Custom Environment Variables Section */}
           <div className="space-y-3 pt-2">
             <div className="flex items-center justify-between">
diff --git a/components/frontend/src/types/project-settings.ts b/components/frontend/src/types/project-settings.ts
index ccb9ebd0f..d0aff5cd9 100644
--- a/components/frontend/src/types/project-settings.ts
+++ b/components/frontend/src/types/project-settings.ts
@@ -4,11 +4,19 @@ export type LLMSettings = {
   maxTokens: number;
 };
 
+export type S3StorageConfig = {
+  enabled: boolean;
+  endpoint: string;
+  bucket: string;
+  region?: string;
+};
+
 export type ProjectDefaultSettings = {
   llmSettings: LLMSettings;
   defaultTimeout: number;
   allowedWebsiteDomains?: string[];
   maxConcurrentSessions: number;
+  s3Storage?: S3StorageConfig;
 };
 
 export type ProjectResourceLimits = {
diff --git a/components/manifests/base/kustomization.yaml b/components/manifests/base/kustomization.yaml
index 58c3c658b..e35dc92cb 100644
--- a/components/manifests/base/kustomization.yaml
+++ b/components/manifests/base/kustomization.yaml
@@ -13,6 +13,7 @@ resources:
 - frontend-deployment.yaml
 - operator-deployment.yaml
 - workspace-pvc.yaml
+- minio-deployment.yaml
 
 # Default images (can be overridden by overlays)
 images:
@@ -24,4 +25,6 @@ images:
   newTag: latest
 - name: quay.io/ambient_code/vteam_claude_runner
   newTag: latest
+- name: quay.io/ambient_code/vteam_state_sync
+  newTag: latest
 
diff --git a/components/manifests/base/minio-credentials-secret.yaml.example b/components/manifests/base/minio-credentials-secret.yaml.example
new file mode 100644
index 000000000..58472d078
--- /dev/null
+++ b/components/manifests/base/minio-credentials-secret.yaml.example
@@ -0,0 +1,31 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: minio-credentials
+type: Opaque
+stringData:
+  # MinIO root credentials
+  # Change these values in production!
+  root-user: "admin"
+  root-password: "changeme123"
+  
+  # For use in project settings (same credentials for convenience)
+  access-key: "admin"
+  secret-key: "changeme123"
+---
+# Instructions:
+# 1. Copy this file to minio-credentials-secret.yaml
+# 2. Change root-user and root-password to secure values
+# 3. Apply: kubectl apply -f minio-credentials-secret.yaml -n ambient-code
+#
+# After MinIO is running:
+# 1. Access MinIO console: kubectl port-forward svc/minio 9001:9001 -n ambient-code
+# 2. Open http://localhost:9001 in browser
+# 3. Login with root-user/root-password
+# 4. Create bucket: "ambient-sessions"
+# 5. Configure bucket in project settings:
+#    - S3_ENDPOINT: http://minio.ambient-code.svc:9000
+#    - S3_BUCKET: ambient-sessions
+#    - S3_ACCESS_KEY: {your-root-user}
+#    - S3_SECRET_KEY: {your-root-password}
+
diff --git a/components/manifests/base/minio-deployment.yaml b/components/manifests/base/minio-deployment.yaml
new file mode 100644
index 000000000..f537d4d74
--- /dev/null
+++ b/components/manifests/base/minio-deployment.yaml
@@ -0,0 +1,102 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: minio-data
+  labels:
+    app: minio
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 50Gi
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: minio
+  labels:
+    app: minio
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: minio
+  template:
+    metadata:
+      labels:
+        app: minio
+    spec:
+      containers:
+      - name: minio
+        image: quay.io/minio/minio:latest
+        args:
+        - server
+        - /data
+        - --console-address
+        - ":9001"
+        env:
+        - name: MINIO_ROOT_USER
+          valueFrom:
+            secretKeyRef:
+              name: minio-credentials
+              key: root-user
+        - name: MINIO_ROOT_PASSWORD
+          valueFrom:
+            secretKeyRef:
+              name: minio-credentials
+              key: root-password
+        ports:
+        - containerPort: 9000
+          name: api
+          protocol: TCP
+        - containerPort: 9001
+          name: console
+          protocol: TCP
+        volumeMounts:
+        - name: data
+          mountPath: /data
+        livenessProbe:
+          httpGet:
+            path: /minio/health/live
+            port: 9000
+          initialDelaySeconds: 30
+          periodSeconds: 10
+        readinessProbe:
+          httpGet:
+            path: /minio/health/ready
+            port: 9000
+          initialDelaySeconds: 10
+          periodSeconds: 5
+        resources:
+          requests:
+            cpu: 250m
+            memory: 512Mi
+          limits:
+            cpu: 1000m
+            memory: 2Gi
+      volumes:
+      - name: data
+        persistentVolumeClaim:
+          claimName: minio-data
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: minio
+  labels:
+    app: minio
+spec:
+  type: ClusterIP
+  ports:
+  - port: 9000
+    targetPort: 9000
+    protocol: TCP
+    name: api
+  - port: 9001
+    targetPort: 9001
+    protocol: TCP
+    name: console
+  selector:
+    app: minio
+
diff --git a/components/manifests/base/operator-deployment.yaml b/components/manifests/base/operator-deployment.yaml
index fe8d38056..fe6a7b08e 100644
--- a/components/manifests/base/operator-deployment.yaml
+++ b/components/manifests/base/operator-deployment.yaml
@@ -19,7 +19,21 @@ spec:
       - name: agentic-operator
         image: quay.io/ambient_code/vteam_operator:latest
         imagePullPolicy: Always
+        args:
+        # Controller-runtime configuration
+        - --max-concurrent-reconciles=10  # Process up to 10 sessions in parallel
+        - --health-probe-bind-address=:8081
+        - --leader-elect=false  # Enable for HA deployments with replicas > 1
+        # Uncomment for debugging with legacy watch-based implementation:
+        # - --legacy-watch
+        ports:
+        - containerPort: 8081
+          name: health
+          protocol: TCP
         env:
+        # Controller concurrency (can be overridden via args)
+        - name: MAX_CONCURRENT_RECONCILES
+          value: "10"
         - name: NAMESPACE
           valueFrom:
             fieldRef:
@@ -35,7 +49,7 @@ spec:
         - name: CONTENT_SERVICE_IMAGE
           value: "quay.io/ambient_code/vteam_backend:latest"
         - name: IMAGE_PULL_POLICY
-          value: "Always"
+          value: "IfNotPresent"
         # Vertex AI configuration from ConfigMap
         - name: CLAUDE_CODE_USE_VERTEX
           valueFrom:
@@ -96,6 +110,20 @@ spec:
               name: google-workflow-app-secret
               key: GOOGLE_OAUTH_CLIENT_SECRET
               optional: true
+        # S3 state sync configuration (defaults - can be overridden per-project in settings)
+        - name: STATE_SYNC_IMAGE
+          value: "quay.io/ambient_code/vteam_state_sync:latest"
+        - name: S3_ENDPOINT
+          value: "http://minio.ambient-code.svc:9000"  # In-cluster MinIO (change for external S3)
+        - name: S3_BUCKET
+          value: "ambient-sessions"  # Create this bucket in MinIO console
+        # OpenTelemetry configuration
+        - name: OTEL_EXPORTER_OTLP_ENDPOINT
+          value: "otel-collector.ambient-code.svc:4317"  # Deploy OTel collector separately
+        - name: DEPLOYMENT_ENV
+          value: "production"
+        - name: VERSION
+          value: "latest"  # Override with actual version in production
         resources:
           requests:
             cpu: 50m
@@ -104,11 +132,15 @@ spec:
             cpu: 200m
             memory: 256Mi
         livenessProbe:
-          exec:
-            command:
-            - /bin/sh
-            - -c
-            - "ps aux | grep '[o]perator' || exit 1"
-          initialDelaySeconds: 30
+          httpGet:
+            path: /healthz
+            port: health
+          initialDelaySeconds: 15
+          periodSeconds: 20
+        readinessProbe:
+          httpGet:
+            path: /readyz
+            port: health
+          initialDelaySeconds: 5
           periodSeconds: 10
       restartPolicy: Always
diff --git a/components/manifests/base/rbac/operator-clusterrole.yaml b/components/manifests/base/rbac/operator-clusterrole.yaml
index e5a6b97ae..6d19ba779 100644
--- a/components/manifests/base/rbac/operator-clusterrole.yaml
+++ b/components/manifests/base/rbac/operator-clusterrole.yaml
@@ -25,10 +25,10 @@ rules:
 - apiGroups: ["batch"]
   resources: ["jobs"]
   verbs: ["get", "list", "watch", "create", "delete"]
-# Pods (for getting logs from failed jobs and cleanup on stop)
+# Pods (create runner pods directly, get logs, and cleanup on stop)
 - apiGroups: [""]
   resources: ["pods"]
-  verbs: ["get", "list", "watch", "delete", "deletecollection"]
+  verbs: ["get", "list", "watch", "create", "delete", "deletecollection"]
 - apiGroups: [""]
   resources: ["pods/log"]
   verbs: ["get"]
diff --git a/components/manifests/deploy.sh b/components/manifests/deploy.sh
index ba0a3ba90..c3f33eb3a 100755
--- a/components/manifests/deploy.sh
+++ b/components/manifests/deploy.sh
@@ -133,6 +133,7 @@ DEFAULT_BACKEND_IMAGE="${DEFAULT_BACKEND_IMAGE:-${CONTAINER_REGISTRY}/vteam_back
 DEFAULT_FRONTEND_IMAGE="${DEFAULT_FRONTEND_IMAGE:-${CONTAINER_REGISTRY}/vteam_frontend:${IMAGE_TAG}}"
 DEFAULT_OPERATOR_IMAGE="${DEFAULT_OPERATOR_IMAGE:-${CONTAINER_REGISTRY}/vteam_operator:${IMAGE_TAG}}"
 DEFAULT_RUNNER_IMAGE="${DEFAULT_RUNNER_IMAGE:-${CONTAINER_REGISTRY}/vteam_claude_runner:${IMAGE_TAG}}"
+DEFAULT_STATE_SYNC_IMAGE="${DEFAULT_STATE_SYNC_IMAGE:-${CONTAINER_REGISTRY}/vteam_state_sync:${IMAGE_TAG}}"
 # Content service image (defaults to same as backend, but can be overridden)
 CONTENT_SERVICE_IMAGE="${CONTENT_SERVICE_IMAGE:-${DEFAULT_BACKEND_IMAGE}}"
 
@@ -233,6 +234,7 @@ echo -e "Backend Image: ${GREEN}${DEFAULT_BACKEND_IMAGE}${NC}"
 echo -e "Frontend Image: ${GREEN}${DEFAULT_FRONTEND_IMAGE}${NC}"
 echo -e "Operator Image: ${GREEN}${DEFAULT_OPERATOR_IMAGE}${NC}"
 echo -e "Runner Image: ${GREEN}${DEFAULT_RUNNER_IMAGE}${NC}"
+echo -e "State Sync Image: ${GREEN}${DEFAULT_STATE_SYNC_IMAGE}${NC}"
 echo -e "Content Service Image: ${GREEN}${CONTENT_SERVICE_IMAGE}${NC}"
 echo ""
 
@@ -305,6 +307,7 @@ kustomize edit set image quay.io/ambient_code/vteam_backend:latest=${DEFAULT_BAC
 kustomize edit set image quay.io/ambient_code/vteam_frontend:latest=${DEFAULT_FRONTEND_IMAGE}
 kustomize edit set image quay.io/ambient_code/vteam_operator:latest=${DEFAULT_OPERATOR_IMAGE}
 kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=${DEFAULT_RUNNER_IMAGE}
+kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=${DEFAULT_STATE_SYNC_IMAGE}
 
 # Build and apply manifests
 echo -e "${BLUE}Building and applying manifests...${NC}"
@@ -428,6 +431,7 @@ kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambie
 kustomize edit set image quay.io/ambient_code/vteam_frontend:latest=quay.io/ambient_code/vteam_frontend:latest
 kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:latest
 kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:latest
+kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:latest
 cd ../..
 
 echo -e "${GREEN}🎯 Ready to create RFE workflows with multi-agent collaboration!${NC}"
diff --git a/components/manifests/observability/README.md b/components/manifests/observability/README.md
new file mode 100644
index 000000000..1513a8eb2
--- /dev/null
+++ b/components/manifests/observability/README.md
@@ -0,0 +1,191 @@
+# Observability Stack for Ambient Code Platform
+
+Observability for OpenShift using **User Workload Monitoring** (no dedicated Prometheus needed).
+
+## Architecture
+
+```
+Operator (OTel SDK) → OTel Collector → OpenShift Prometheus
+                                              ↓
+                                       OpenShift Console
+                                              ↓
+                                       Grafana (optional)
+```
+
+## Quick Start
+
+### Deploy Base Stack
+
+```bash
+# From repository root
+make deploy-observability
+
+# Or manually
+kubectl apply -k components/manifests/observability/
+```
+
+**What you get**: OTel Collector + ServiceMonitor (128MB)
+
+### View Metrics
+
+Open **OpenShift Console → Observe → Metrics** and query:
+- `ambient_sessions_total`
+- `ambient_session_startup_duration_bucket`
+- `ambient_session_errors`
+
+---
+
+## Optional: Add Grafana
+
+If you want custom dashboards:
+
+```bash
+# Add Grafana overlay
+kubectl apply -k components/manifests/observability/overlays/with-grafana/
+```
+
+**Adds**: Grafana (additional 128MB) - still uses OpenShift Prometheus
+
+**Access Grafana**:
+```bash
+# Create route
+oc create route edge grafana --service=grafana -n ambient-code
+
+# Get URL
+oc get route grafana -n ambient-code -o jsonpath='{.spec.host}'
+# Login: admin/admin
+```
+
+**Import dashboard**: Upload `dashboards/ambient-operator-dashboard.json` in Grafana UI
+
+---
+
+## Components
+
+| Component | What It Does | Resource Usage |
+|-----------|--------------|----------------|
+| **OTel Collector** | Receives metrics from operator, exports to Prometheus format | 128MB RAM |
+| **ServiceMonitor** | Tells OpenShift Prometheus to scrape OTel Collector | None |
+| **Grafana** (optional) | Custom dashboards | 128MB RAM, 5GB storage |
+
+## Metrics Available
+
+All metrics are prefixed with `ambient_`:
+
+| Metric | Type | Description | Alert Threshold |
+|--------|------|-------------|-----------------|
+| `ambient_session_startup_duration` | Histogram | Time from creation to Running phase | p95 > 60s |
+| `ambient_session_phase_transitions` | Counter | Phase transition events | - |
+| `ambient_sessions_total` | Counter | Total sessions created | Sudden spikes |
+| `ambient_sessions_completed` | Counter | Sessions that reached terminal states | - |
+| `ambient_reconcile_duration` | Histogram | Reconciliation loop performance | p95 > 10s |
+| `ambient_pod_creation_duration` | Histogram | Time to create runner pods | p95 > 30s |
+| `ambient_token_provision_duration` | Histogram | Token provisioning time | p95 > 5s |
+| `ambient_session_errors` | Counter | Errors during reconciliation | Rate > 0.1/s |
+
+## Accessing Components
+
+### OpenShift Console (Options 1 & 2)
+
+Navigate to **Observe → Metrics** and query:
+
+```promql
+# Total sessions created
+ambient_sessions_total
+
+# Session creation rate
+rate(ambient_sessions_total[5m])
+
+# p95 startup time
+histogram_quantile(0.95, rate(ambient_session_startup_duration_bucket[5m]))
+
+# Error rate by namespace
+sum by (namespace) (rate(ambient_session_errors[5m]))
+```
+
+### OTel Collector Logs
+
+```bash
+kubectl logs -n ambient-code -l app=otel-collector -f
+```
+
+## Production Setup
+
+### Enable OpenShift User Workload Monitoring
+
+Check if enabled:
+```bash
+oc -n openshift-user-workload-monitoring get pod
+```
+
+If not:
+```bash
+oc apply -f - <<EOF
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: cluster-monitoring-config
+  namespace: openshift-monitoring
+data:
+  config.yaml: |
+    enableUserWorkload: true
+EOF
+```
+
+## Troubleshooting
+
+### No metrics showing in OpenShift Console
+
+1. **Verify User Workload Monitoring is enabled**:
+   ```bash
+   oc -n openshift-user-workload-monitoring get pod
+   # Should see prometheus-user-workload pods
+   ```
+
+2. **Check ServiceMonitor is discovered**:
+   ```bash
+   oc get servicemonitor ambient-otel-collector -n ambient-code
+   oc describe servicemonitor ambient-otel-collector -n ambient-code
+   ```
+
+3. **Check OTel Collector is receiving metrics**:
+   ```bash
+   kubectl logs -n ambient-code -l app=otel-collector | grep -i "metric"
+   ```
+
+4. **Check operator is sending metrics**:
+   ```bash
+   kubectl logs -n ambient-code -l app=agentic-operator | grep -i "otel\|metric"
+   ```
+
+5. **Test direct query to OTel Collector**:
+   ```bash
+   kubectl port-forward svc/otel-collector 8889:8889 -n ambient-code
+   curl http://localhost:8889/metrics | grep ambient
+   ```
+
+### Grafana shows "No data"
+
+1. **Verify Grafana ServiceAccount has permissions**:
+   ```bash
+   oc auth can-i get --subresource=metrics pods \
+     --as=system:serviceaccount:ambient-code:grafana -n openshift-monitoring
+   # Should return "yes"
+   ```
+
+2. **Check datasource configuration** in Grafana:
+   - Go to Configuration → Data Sources
+   - Test the OpenShift Prometheus datasource
+   - Check for authentication errors
+
+3. **Verify you're querying the right metrics**:
+   - Metrics should be prefixed with `ambient_`
+   - Try simple query first: `ambient_sessions_total`
+
+## Cleanup
+
+```bash
+kubectl delete -k components/manifests/observability/overlays/with-grafana/  # If Grafana deployed
+kubectl delete -k components/manifests/observability/
+```
+
diff --git a/components/manifests/observability/dashboards/ambient-operator-dashboard.json b/components/manifests/observability/dashboards/ambient-operator-dashboard.json
new file mode 100644
index 000000000..36acdee09
--- /dev/null
+++ b/components/manifests/observability/dashboards/ambient-operator-dashboard.json
@@ -0,0 +1,366 @@
+{
+  "annotations": {
+    "list": []
+  },
+  "editable": true,
+  "fiscalYearStartMonth": 0,
+  "graphTooltip": 1,
+  "id": null,
+  "links": [],
+  "panels": [
+    {
+      "title": "Active Sessions",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 6, "w": 8, "x": 0, "y": 0},
+      "id": 1,
+      "targets": [{
+        "expr": "ambient_sessions_active",
+        "legendFormat": "{{namespace}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "short",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "Pending Sessions",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 6, "w": 8, "x": 8, "y": 0},
+      "id": 2,
+      "targets": [{
+        "expr": "ambient_sessions_pending",
+        "legendFormat": "{{namespace}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "short",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "Sessions by Project (Rate 5m)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 6, "w": 8, "x": 16, "y": 0},
+      "id": 3,
+      "targets": [{
+        "expr": "rate(ambient_sessions_by_project[5m])",
+        "legendFormat": "{{namespace}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "reqps",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "Session Startup Duration (p50, p95, p99)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 6},
+      "id": 4,
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p50",
+          "refId": "B"
+        },
+        {
+          "expr": "histogram_quantile(0.99, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p99",
+          "refId": "C"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "s",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      },
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "bottom"
+        }
+      }
+    },
+    {
+      "title": "Session Total Duration (p50, p95, p99)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 6},
+      "id": 5,
+      "targets": [
+        {
+          "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p95",
+          "refId": "A"
+        },
+        {
+          "expr": "histogram_quantile(0.50, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p50",
+          "refId": "B"
+        },
+        {
+          "expr": "histogram_quantile(0.99, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+          "legendFormat": "{{namespace}} p99",
+          "refId": "C"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "s",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      },
+      "options": {
+        "legend": {
+          "calcs": ["mean", "max"],
+          "displayMode": "table",
+          "placement": "bottom"
+        }
+      }
+    },
+    {
+      "title": "Session Phase Transitions",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
+      "id": 6,
+      "targets": [{
+        "expr": "sum by (namespace, to_phase) (increase(ambient_session_phase_transitions[5m]))",
+        "legendFormat": "{{namespace}} → {{to_phase}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "short",
+          "custom": {
+            "lineWidth": 2,
+            "fillOpacity": 10,
+            "stacking": {"mode": "normal"}
+          }
+        }
+      }
+    },
+    {
+      "title": "Sessions by User (Top 10)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
+      "id": 7,
+      "targets": [{
+        "expr": "topk(10, rate(ambient_sessions_by_user[5m]))",
+        "legendFormat": "{{user}} ({{namespace}})",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "reqps",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      },
+      "options": {
+        "legend": {
+          "calcs": ["last"],
+          "displayMode": "table",
+          "placement": "bottom"
+        }
+      }
+    },
+    {
+      "title": "Image Pull Duration (p95)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
+      "id": 8,
+      "targets": [{
+        "expr": "histogram_quantile(0.95, sum by (image, le) (rate(ambient_image_pull_duration_bucket[5m])))",
+        "legendFormat": "{{image}} p95",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "s",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "Reconcile Duration (p95 by Phase)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
+      "id": 9,
+      "targets": [{
+        "expr": "histogram_quantile(0.95, sum by (phase, le) (rate(ambient_reconcile_duration_bucket[5m])))",
+        "legendFormat": "{{phase}} p95",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "s",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "Error Rates",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 30},
+      "id": 10,
+      "targets": [
+        {
+          "expr": "rate(ambient_reconcile_retries[5m])",
+          "legendFormat": "Reconcile Retries ({{namespace}})",
+          "refId": "A"
+        },
+        {
+          "expr": "rate(ambient_session_timeouts[5m])",
+          "legendFormat": "Timeouts ({{namespace}})",
+          "refId": "B"
+        },
+        {
+          "expr": "rate(ambient_s3_errors[5m])",
+          "legendFormat": "S3 Errors ({{namespace}}, {{operation}})",
+          "refId": "C"
+        },
+        {
+          "expr": "rate(ambient_token_refresh_errors[5m])",
+          "legendFormat": "Token Refresh Errors ({{namespace}})",
+          "refId": "D"
+        },
+        {
+          "expr": "rate(ambient_pod_restarts[5m])",
+          "legendFormat": "Pod Restarts ({{namespace}}, {{session}})",
+          "refId": "E"
+        }
+      ],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "reqps",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      },
+      "options": {
+        "legend": {
+          "calcs": ["last", "max"],
+          "displayMode": "table",
+          "placement": "bottom"
+        }
+      }
+    },
+    {
+      "title": "Sessions Completed by Final Phase",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 30},
+      "id": 11,
+      "targets": [{
+        "expr": "sum by (namespace, final_phase) (increase(ambient_sessions_completed[5m]))",
+        "legendFormat": "{{namespace}} - {{final_phase}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "short",
+          "custom": {
+            "lineWidth": 2,
+            "fillOpacity": 10,
+            "stacking": {"mode": "normal"}
+          }
+        }
+      }
+    },
+    {
+      "title": "Token Provision Duration (p95)",
+      "type": "timeseries",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 6, "w": 12, "x": 0, "y": 38},
+      "id": 12,
+      "targets": [{
+        "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_token_provision_duration_bucket[5m])))",
+        "legendFormat": "{{namespace}} p95",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "palette-classic"},
+          "unit": "s",
+          "custom": {"lineWidth": 2, "fillOpacity": 10}
+        }
+      }
+    },
+    {
+      "title": "S3 Storage (GB per Namespace)",
+      "type": "gauge",
+      "datasource": {"type": "prometheus", "uid": "prometheus"},
+      "gridPos": {"h": 6, "w": 12, "x": 12, "y": 38},
+      "id": 13,
+      "targets": [{
+        "expr": "ambient_s3_storage_bytes / 1024 / 1024 / 1024",
+        "legendFormat": "{{namespace}}",
+        "refId": "A"
+      }],
+      "fieldConfig": {
+        "defaults": {
+          "color": {"mode": "thresholds"},
+          "unit": "decgbytes",
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              {"color": "green", "value": null},
+              {"color": "yellow", "value": 10},
+              {"color": "red", "value": 50}
+            ]
+          }
+        }
+      },
+      "options": {
+        "orientation": "auto",
+        "showThresholdLabels": false,
+        "showThresholdMarkers": true
+      }
+    }
+  ],
+  "schemaVersion": 39,
+  "tags": ["ambient-code", "operator", "sessions"],
+  "templating": {"list": []},
+  "time": {
+    "from": "now-1h",
+    "to": "now"
+  },
+  "timepicker": {},
+  "timezone": "browser",
+  "title": "Ambient Code Operator - Session Metrics",
+  "uid": "ambient-operator",
+  "version": 2,
+  "weekStart": ""
+}
diff --git a/components/manifests/observability/grafana.yaml b/components/manifests/observability/grafana.yaml
new file mode 100644
index 000000000..c076905c9
--- /dev/null
+++ b/components/manifests/observability/grafana.yaml
@@ -0,0 +1,490 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-datasources
+data:
+  datasources.yaml: |
+    apiVersion: 1
+    # Default datasource - will be overridden by overlay patch for OpenShift
+    datasources: []
+
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboards-provider
+data:
+  dashboards.yaml: |
+    apiVersion: 1
+    providers:
+      - name: 'Ambient Code'
+        orgId: 1
+        folder: ''
+        type: file
+        disableDeletion: false
+        updateIntervalSeconds: 10
+        allowUiUpdates: true
+        options:
+          path: /var/lib/grafana/dashboards
+
+---
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: grafana-storage
+spec:
+  accessModes: [ReadWriteOnce]
+  resources:
+    requests:
+      storage: 5Gi
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: grafana
+  labels:
+    app: grafana
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: grafana
+  template:
+    metadata:
+      labels:
+        app: grafana
+    spec:
+      containers:
+      - name: grafana
+        image: grafana/grafana:11.4.0
+        ports:
+        - containerPort: 3000
+          name: http
+        env:
+        - name: GF_SECURITY_ADMIN_USER
+          value: admin
+        - name: GF_SECURITY_ADMIN_PASSWORD
+          value: admin
+        - name: GF_USERS_ALLOW_SIGN_UP
+          value: "false"
+        volumeMounts:
+        - name: datasources
+          mountPath: /etc/grafana/provisioning/datasources
+        - name: dashboards-provider
+          mountPath: /etc/grafana/provisioning/dashboards/dashboards.yaml
+          subPath: dashboards.yaml
+        - name: dashboards
+          mountPath: /var/lib/grafana/dashboards
+        - name: storage
+          mountPath: /var/lib/grafana/data
+        resources:
+          requests:
+            cpu: 100m
+            memory: 128Mi
+          limits:
+            cpu: 500m
+            memory: 512Mi
+      volumes:
+      - name: datasources
+        configMap:
+          name: grafana-datasources
+      - name: dashboards-provider
+        configMap:
+          name: grafana-dashboards-provider
+      - name: dashboards
+        configMap:
+          name: grafana-dashboard-operator
+      - name: storage
+        persistentVolumeClaim:
+          claimName: grafana-storage
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: grafana
+  labels:
+    app: grafana
+spec:
+  selector:
+    app: grafana
+  ports:
+  - port: 3000
+    name: http
+
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-dashboard-operator
+data:
+  ambient-operator-dashboard.json: |-
+    {
+      "annotations": {
+        "list": []
+      },
+      "editable": true,
+      "fiscalYearStartMonth": 0,
+      "graphTooltip": 1,
+      "id": null,
+      "links": [],
+      "panels": [
+        {
+          "title": "Active Sessions",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 6, "w": 8, "x": 0, "y": 0},
+          "id": 1,
+          "targets": [{
+            "expr": "ambient_sessions_active",
+            "legendFormat": "{{namespace}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "short",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "Pending Sessions",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 6, "w": 8, "x": 8, "y": 0},
+          "id": 2,
+          "targets": [{
+            "expr": "ambient_sessions_pending",
+            "legendFormat": "{{namespace}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "short",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "Sessions by Project (Rate 5m)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 6, "w": 8, "x": 16, "y": 0},
+          "id": 3,
+          "targets": [{
+            "expr": "rate(ambient_sessions_by_project[5m])",
+            "legendFormat": "{{namespace}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "reqps",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "Session Startup Duration (p50, p95, p99)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 6},
+          "id": 4,
+          "targets": [
+            {
+              "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p95",
+              "refId": "A"
+            },
+            {
+              "expr": "histogram_quantile(0.50, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p50",
+              "refId": "B"
+            },
+            {
+              "expr": "histogram_quantile(0.99, sum by (namespace, le) (rate(ambient_session_startup_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p99",
+              "refId": "C"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "s",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          },
+          "options": {
+            "legend": {
+              "calcs": ["mean", "max"],
+              "displayMode": "table",
+              "placement": "bottom"
+            }
+          }
+        },
+        {
+          "title": "Session Total Duration (p50, p95, p99)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 6},
+          "id": 5,
+          "targets": [
+            {
+              "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p95",
+              "refId": "A"
+            },
+            {
+              "expr": "histogram_quantile(0.50, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p50",
+              "refId": "B"
+            },
+            {
+              "expr": "histogram_quantile(0.99, sum by (namespace, le) (rate(ambient_session_total_duration_bucket[5m])))",
+              "legendFormat": "{{namespace}} p99",
+              "refId": "C"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "s",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          },
+          "options": {
+            "legend": {
+              "calcs": ["mean", "max"],
+              "displayMode": "table",
+              "placement": "bottom"
+            }
+          }
+        },
+        {
+          "title": "Session Phase Transitions",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 14},
+          "id": 6,
+          "targets": [{
+            "expr": "sum by (namespace, to_phase) (increase(ambient_session_phase_transitions[5m]))",
+            "legendFormat": "{{namespace}} → {{to_phase}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "short",
+              "custom": {
+                "lineWidth": 2,
+                "fillOpacity": 10,
+                "stacking": {"mode": "normal"}
+              }
+            }
+          }
+        },
+        {
+          "title": "Sessions by User (Top 10)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 14},
+          "id": 7,
+          "targets": [{
+            "expr": "topk(10, rate(ambient_sessions_by_user[5m]))",
+            "legendFormat": "{{user}} ({{namespace}})",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "reqps",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          },
+          "options": {
+            "legend": {
+              "calcs": ["last"],
+              "displayMode": "table",
+              "placement": "bottom"
+            }
+          }
+        },
+        {
+          "title": "Image Pull Duration (p95)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 22},
+          "id": 8,
+          "targets": [{
+            "expr": "histogram_quantile(0.95, sum by (image, le) (rate(ambient_image_pull_duration_bucket[5m])))",
+            "legendFormat": "{{image}} p95",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "s",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "Reconcile Duration (p95 by Phase)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 22},
+          "id": 9,
+          "targets": [{
+            "expr": "histogram_quantile(0.95, sum by (phase, le) (rate(ambient_reconcile_duration_bucket[5m])))",
+            "legendFormat": "{{phase}} p95",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "s",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "Error Rates",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 0, "y": 30},
+          "id": 10,
+          "targets": [
+            {
+              "expr": "rate(ambient_reconcile_retries[5m])",
+              "legendFormat": "Reconcile Retries ({{namespace}})",
+              "refId": "A"
+            },
+            {
+              "expr": "rate(ambient_session_timeouts[5m])",
+              "legendFormat": "Timeouts ({{namespace}})",
+              "refId": "B"
+            },
+            {
+              "expr": "rate(ambient_s3_errors[5m])",
+              "legendFormat": "S3 Errors ({{namespace}}, {{operation}})",
+              "refId": "C"
+            },
+            {
+              "expr": "rate(ambient_token_refresh_errors[5m])",
+              "legendFormat": "Token Refresh Errors ({{namespace}})",
+              "refId": "D"
+            },
+            {
+              "expr": "rate(ambient_pod_restarts[5m])",
+              "legendFormat": "Pod Restarts ({{namespace}}, {{session}})",
+              "refId": "E"
+            }
+          ],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "reqps",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          },
+          "options": {
+            "legend": {
+              "calcs": ["last", "max"],
+              "displayMode": "table",
+              "placement": "bottom"
+            }
+          }
+        },
+        {
+          "title": "Sessions Completed by Final Phase",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 8, "w": 12, "x": 12, "y": 30},
+          "id": 11,
+          "targets": [{
+            "expr": "sum by (namespace, final_phase) (increase(ambient_sessions_completed[5m]))",
+            "legendFormat": "{{namespace}} - {{final_phase}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "short",
+              "custom": {
+                "lineWidth": 2,
+                "fillOpacity": 10,
+                "stacking": {"mode": "normal"}
+              }
+            }
+          }
+        },
+        {
+          "title": "Token Provision Duration (p95)",
+          "type": "timeseries",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 6, "w": 12, "x": 0, "y": 38},
+          "id": 12,
+          "targets": [{
+            "expr": "histogram_quantile(0.95, sum by (namespace, le) (rate(ambient_token_provision_duration_bucket[5m])))",
+            "legendFormat": "{{namespace}} p95",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "palette-classic"},
+              "unit": "s",
+              "custom": {"lineWidth": 2, "fillOpacity": 10}
+            }
+          }
+        },
+        {
+          "title": "S3 Storage (GB per Namespace)",
+          "type": "gauge",
+          "datasource": {"type": "prometheus", "uid": "prometheus"},
+          "gridPos": {"h": 6, "w": 12, "x": 12, "y": 38},
+          "id": 13,
+          "targets": [{
+            "expr": "ambient_s3_storage_bytes / 1024 / 1024 / 1024",
+            "legendFormat": "{{namespace}}",
+            "refId": "A"
+          }],
+          "fieldConfig": {
+            "defaults": {
+              "color": {"mode": "thresholds"},
+              "unit": "decgbytes",
+              "thresholds": {
+                "mode": "absolute",
+                "steps": [
+                  {"color": "green", "value": null},
+                  {"color": "yellow", "value": 10},
+                  {"color": "red", "value": 50}
+                ]
+              }
+            }
+          },
+          "options": {
+            "orientation": "auto",
+            "showThresholdLabels": false,
+            "showThresholdMarkers": true
+          }
+        }
+      ],
+      "schemaVersion": 39,
+      "tags": ["ambient-code", "operator", "sessions"],
+      "templating": {"list": []},
+      "time": {
+        "from": "now-1h",
+        "to": "now"
+      },
+      "timepicker": {},
+      "timezone": "browser",
+      "title": "Ambient Code Operator - Session Metrics",
+      "uid": "ambient-operator",
+      "version": 2,
+      "weekStart": ""
+    }
+
+---
diff --git a/components/manifests/observability/kustomization.yaml b/components/manifests/observability/kustomization.yaml
new file mode 100644
index 000000000..e3e71c6e9
--- /dev/null
+++ b/components/manifests/observability/kustomization.yaml
@@ -0,0 +1,15 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: ambient-code
+
+# Minimal observability for OpenShift
+# Uses OpenShift's built-in User Workload Monitoring instead of dedicated Prometheus
+#
+# Deploy: kubectl apply -k components/manifests/observability/
+# Add Grafana: kubectl apply -k components/manifests/observability/overlays/with-grafana/
+
+resources:
+  - otel-collector.yaml
+  - servicemonitor.yaml
+
diff --git a/components/manifests/observability/otel-collector.yaml b/components/manifests/observability/otel-collector.yaml
new file mode 100644
index 000000000..f28995fd9
--- /dev/null
+++ b/components/manifests/observability/otel-collector.yaml
@@ -0,0 +1,108 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: otel-collector-config
+data:
+  otel-collector-config.yaml: |
+    receivers:
+      otlp:
+        protocols:
+          grpc:
+            endpoint: 0.0.0.0:4317
+          http:
+            endpoint: 0.0.0.0:4318
+
+    processors:
+      batch:
+        timeout: 10s
+        send_batch_size: 1024
+      
+      resource:
+        attributes:
+          - key: service.namespace
+            value: ambient-code
+            action: insert
+      
+      memory_limiter:
+        check_interval: 1s
+        limit_mib: 512
+
+    exporters:
+      prometheus:
+        endpoint: "0.0.0.0:8889"
+        const_labels:
+          platform: ambient_code
+      
+      logging:
+        loglevel: info
+        sampling_initial: 5
+        sampling_thereafter: 200
+
+    service:
+      pipelines:
+        metrics:
+          receivers: [otlp]
+          processors: [memory_limiter, batch, resource]
+          exporters: [prometheus, logging]
+
+---
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: otel-collector
+  labels:
+    app: otel-collector
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: otel-collector
+  template:
+    metadata:
+      labels:
+        app: otel-collector
+    spec:
+      containers:
+      - name: otel-collector
+        image: otel/opentelemetry-collector-contrib:0.94.0
+        args:
+          - "--config=/conf/otel-collector-config.yaml"
+        ports:
+        - containerPort: 4317
+          name: otlp-grpc
+        - containerPort: 4318
+          name: otlp-http
+        - containerPort: 8889
+          name: prometheus
+        volumeMounts:
+        - name: config
+          mountPath: /conf
+        resources:
+          requests:
+            cpu: 100m
+            memory: 128Mi
+          limits:
+            cpu: 500m
+            memory: 512Mi
+      volumes:
+      - name: config
+        configMap:
+          name: otel-collector-config
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: otel-collector
+  labels:
+    app: otel-collector
+spec:
+  selector:
+    app: otel-collector
+  ports:
+  - port: 4317
+    name: otlp-grpc
+  - port: 4318
+    name: otlp-http
+  - port: 8889
+    name: prometheus
diff --git a/components/manifests/observability/overlays/with-grafana/grafana-datasource-patch.yaml b/components/manifests/observability/overlays/with-grafana/grafana-datasource-patch.yaml
new file mode 100644
index 000000000..61caeff11
--- /dev/null
+++ b/components/manifests/observability/overlays/with-grafana/grafana-datasource-patch.yaml
@@ -0,0 +1,45 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: grafana-datasources
+  namespace: ambient-code
+data:
+  datasources.yaml: |
+    apiVersion: 1
+    datasources:
+      # Use OpenShift's User Workload Monitoring Prometheus
+      - name: Prometheus
+        uid: prometheus
+        type: prometheus
+        access: proxy
+        url: https://thanos-querier.openshift-monitoring.svc:9091
+        isDefault: true
+        editable: false
+        jsonData:
+          httpHeaderName1: "Authorization"
+          tlsSkipVerify: true
+          timeInterval: 30s
+        secureJsonData:
+          # Token will be injected via ServiceAccount
+          httpHeaderValue1: "Bearer ${GRAFANA_SA_TOKEN}"
+---
+# ServiceAccount for Grafana to access OpenShift Prometheus
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: grafana
+  namespace: ambient-code
+---
+# ClusterRoleBinding to allow Grafana to query metrics
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRoleBinding
+metadata:
+  name: grafana-cluster-monitoring-view
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: ClusterRole
+  name: cluster-monitoring-view
+subjects:
+- kind: ServiceAccount
+  name: grafana
+  namespace: ambient-code
diff --git a/components/manifests/observability/overlays/with-grafana/kustomization.yaml b/components/manifests/observability/overlays/with-grafana/kustomization.yaml
new file mode 100644
index 000000000..517c0d6c6
--- /dev/null
+++ b/components/manifests/observability/overlays/with-grafana/kustomization.yaml
@@ -0,0 +1,15 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: ambient-code
+
+# Add Grafana on top of base observability stack
+
+resources:
+  - ../../otel-collector.yaml
+  - ../../servicemonitor.yaml
+  - ../../grafana.yaml
+
+patchesStrategicMerge:
+  - grafana-datasource-patch.yaml
+
diff --git a/components/manifests/observability/servicemonitor.yaml b/components/manifests/observability/servicemonitor.yaml
new file mode 100644
index 000000000..1ffc36bc5
--- /dev/null
+++ b/components/manifests/observability/servicemonitor.yaml
@@ -0,0 +1,22 @@
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: ambient-otel-collector
+  namespace: ambient-code
+  labels:
+    app: otel-collector
+    # Required for OpenShift User Workload Monitoring to discover this ServiceMonitor
+    openshift.io/cluster-monitoring: "true"
+spec:
+  selector:
+    matchLabels:
+      app: otel-collector
+  endpoints:
+  - port: prometheus
+    interval: 30s
+    path: /metrics
+    scheme: http
+  namespaceSelector:
+    matchNames:
+    - ambient-code
+
diff --git a/components/manifests/overlays/production/kustomization.yaml b/components/manifests/overlays/production/kustomization.yaml
index 324543b64..111c7ab51 100644
--- a/components/manifests/overlays/production/kustomization.yaml
+++ b/components/manifests/overlays/production/kustomization.yaml
@@ -57,3 +57,6 @@ images:
 - name: quay.io/ambient_code/vteam_operator:latest
   newName: quay.io/ambient_code/vteam_operator
   newTag: latest
+- name: quay.io/ambient_code/vteam_state_sync:latest
+  newName: quay.io/ambient_code/vteam_state_sync
+  newTag: latest
diff --git a/components/operator/Dockerfile b/components/operator/Dockerfile
index 04472f6d6..2ea3e5ce7 100644
--- a/components/operator/Dockerfile
+++ b/components/operator/Dockerfile
@@ -68,5 +68,7 @@ RUN chmod +x ./operator && chmod 775 /app
 
 USER 1001
 
-# Command to run the executable
-CMD ["./operator"]
+# Use ENTRYPOINT so that args from K8s are appended, not replaced
+ENTRYPOINT ["./operator"]
+# Default args (can be overridden by K8s deployment)
+CMD []
diff --git a/components/operator/README.md b/components/operator/README.md
index 8ce001d33..a741df29d 100644
--- a/components/operator/README.md
+++ b/components/operator/README.md
@@ -4,17 +4,64 @@ Kubernetes operator watching Custom Resources and managing AgenticSession Job li
 
 ## Features
 
+- **Controller-runtime based** - Uses work queues with rate limiting for scalable processing
+- **Concurrent reconciliation** - Processes multiple sessions in parallel (configurable)
+- **Event deduplication** - Multiple rapid events are coalesced into single reconciles
+- **Automatic retries** - Failed reconciles are requeued with exponential backoff
 - Watches AgenticSession CRs and spawns Jobs with runner pods
 - Updates CR status based on Job completion
 - Handles timeout and cleanup
-- Reconnects watch on channel close
 - Idempotent reconciliation
 
+## Configuration
+
+### Command Line Flags
+
+| Flag | Default | Description |
+|------|---------|-------------|
+| `--max-concurrent-reconciles` | 10 | Maximum parallel session reconciliations |
+| `--metrics-bind-address` | :8080 | Prometheus metrics endpoint |
+| `--health-probe-bind-address` | :8081 | Health/readiness probe endpoint |
+| `--leader-elect` | false | Enable leader election for HA |
+| `--legacy-watch` | false | Use old watch-based implementation (debugging) |
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `MAX_CONCURRENT_RECONCILES` | 10 | Override max concurrent reconciles |
+| `DEV_MODE` | false | Enable development logging |
+| `NAMESPACE` | default | Operator namespace |
+| `BACKEND_NAMESPACE` | (same as NAMESPACE) | Backend API namespace |
+| `AMBIENT_CODE_RUNNER_IMAGE` | quay.io/ambient_code/vteam_claude_runner:latest | Runner image |
+
+### Performance Tuning
+
+For high-throughput environments:
+
+```yaml
+args:
+  - --max-concurrent-reconciles=20  # Increase parallelism
+  - --leader-elect=false
+```
+
+For HA deployments:
+
+```yaml
+spec:
+  replicas: 2
+  template:
+    spec:
+      containers:
+      - args:
+        - --leader-elect=true  # Only one active controller
+```
+
 ## Development
 
 ### Prerequisites
 
-- Go 1.21+
+- Go 1.24+
 - kubectl
 - Kubernetes cluster access
 - CRDs installed in cluster
@@ -29,6 +76,9 @@ go build -o operator .
 
 # Run locally (requires k8s access and CRDs installed)
 go run .
+
+# Run with legacy watch mode (for debugging)
+go run . --legacy-watch
 ```
 
 ### Build
@@ -85,24 +135,109 @@ gofmt -w .
 operator/
 ├── internal/
 │   ├── config/        # K8s client init, config loading
+│   ├── controller/    # Controller-runtime reconcilers (NEW)
+│   │   ├── agenticsession_controller.go  # Main reconciler with work queue
+│   │   └── reconcile_phases.go           # Phase-specific reconciliation logic
 │   ├── types/         # GVR definitions, resource helpers
-│   ├── handlers/      # Watch handlers (sessions, namespaces, projectsettings)
+│   ├── handlers/      # Handler logic called from controllers
+│   │   ├── sessions.go      # Session management logic
+│   │   ├── reconciler.go    # Exported functions for controller
+│   │   ├── namespaces.go    # Namespace watcher
+│   │   └── projectsettings.go  # ProjectSettings watcher
 │   └── services/      # Reusable services (PVC provisioning, etc.)
-└── main.go            # Watch coordination
+└── main.go            # Manager setup and controller registration
 ```
 
+### Controller-Runtime Benefits
+
+The operator uses [controller-runtime](https://github.com/kubernetes-sigs/controller-runtime) which provides:
+
+1. **Work Queue** - Events are added to a queue and processed asynchronously
+2. **Rate Limiting** - Exponential backoff prevents API server overload
+3. **Deduplication** - Multiple rapid events = single reconcile
+4. **Concurrency** - Multiple reconcilers process sessions in parallel
+5. **Predicates** - Filter events to reduce unnecessary reconciles
+
 ### Key Patterns
 
 See `CLAUDE.md` in project root for:
-- Watch loop with reconnection
-- Reconciliation pattern
+- Reconciliation pattern with Result and error handling
 - Status updates (UpdateStatus subresource)
-- Goroutine monitoring
-- Error handling
+- Error handling and requeuing
+- Phase-based state machine
+
+## Prometheus Metrics
+
+The operator exposes metrics at `:8080/metrics` for monitoring session lifecycle and performance.
+
+### Available Metrics
+
+| Metric | Type | Labels | Description |
+|--------|------|--------|-------------|
+| `ambient_session_startup_duration_seconds` | Histogram | `namespace` | Time from Pending to Running |
+| `ambient_sessions_total` | Counter | `namespace` | Total sessions created |
+| `ambient_sessions_completed_total` | Counter | `namespace`, `final_phase` | Sessions reaching terminal states (Stopped, Failed, Completed) |
+| `ambient_session_phase_transitions_total` | Counter | `namespace`, `from_phase`, `to_phase` | Phase transition counts |
+| `ambient_reconcile_duration_seconds` | Histogram | `phase`, `success` | Reconcile loop timing |
+| `ambient_pod_creation_duration_seconds` | Histogram | `namespace` | Pod creation timing |
+| `ambient_token_provision_duration_seconds` | Histogram | `namespace` | Runner token provisioning time |
+| `ambient_session_errors_total` | Counter | `namespace`, `phase`, `error_type` | Error tracking |
+
+### Example PromQL Queries
+
+**95th percentile startup time:**
+```promql
+histogram_quantile(0.95, sum(rate(ambient_session_startup_duration_seconds_bucket[5m])) by (le, namespace))
+```
+
+**Average startup time:**
+```promql
+sum(rate(ambient_session_startup_duration_seconds_sum[5m])) / sum(rate(ambient_session_startup_duration_seconds_count[5m]))
+```
+
+**Sessions started per hour:**
+```promql
+sum(increase(ambient_sessions_total[1h])) by (namespace)
+```
+
+**Phase transitions per minute:**
+```promql
+sum(rate(ambient_session_phase_transitions_total[1m])) by (from_phase, to_phase)
+```
+
+**Error rate:**
+```promql
+sum(rate(ambient_session_errors_total[5m])) by (phase, error_type)
+```
+
+**Reconcile success rate:**
+```promql
+sum(rate(ambient_reconcile_duration_seconds_count{success="true"}[5m])) / sum(rate(ambient_reconcile_duration_seconds_count[5m]))
+```
+
+### OpenShift User Workload Monitoring
+
+To enable metrics scraping in OpenShift:
+
+1. Enable user workload monitoring (done once per cluster):
+```bash
+oc -n openshift-monitoring edit configmap cluster-monitoring-config
+# Add: enableUserWorkload: true
+```
+
+2. Apply the ServiceMonitor (included in manifests):
+```bash
+oc apply -f components/manifests/base/operator-metrics-service.yaml
+```
+
+3. Access metrics in OpenShift Console → Observe → Metrics
 
 ## Reference Files
 
-- `internal/handlers/sessions.go` - Watch loop, reconciliation, status updates
+- `internal/controller/agenticsession_controller.go` - Main reconciler
+- `internal/controller/reconcile_phases.go` - Phase handlers
+- `internal/controller/metrics.go` - Prometheus metric definitions
+- `internal/handlers/reconciler.go` - Exported handler functions
+- `internal/handlers/sessions.go` - Core session management logic
 - `internal/config/config.go` - K8s client initialization
 - `internal/types/resources.go` - GVR definitions
-- `internal/services/infrastructure.go` - Reusable services
diff --git a/components/operator/go.mod b/components/operator/go.mod
index febd720bd..2d87a3810 100644
--- a/components/operator/go.mod
+++ b/components/operator/go.mod
@@ -5,22 +5,38 @@ go 1.24.0
 toolchain go1.24.7
 
 require (
+	go.opentelemetry.io/otel v1.33.0
+	go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.33.0
+	go.opentelemetry.io/otel/metric v1.33.0
+	go.opentelemetry.io/otel/sdk v1.33.0
+	go.opentelemetry.io/otel/sdk/metric v1.33.0
 	k8s.io/api v0.34.0
 	k8s.io/apimachinery v0.34.0
 	k8s.io/client-go v0.34.0
+	sigs.k8s.io/controller-runtime v0.20.4
 )
 
 require (
-	github.com/davecgh/go-spew v1.1.1 // indirect
+	github.com/beorn7/perks v1.0.1 // indirect
+	github.com/cenkalti/backoff/v4 v4.3.0 // indirect
+	github.com/cespare/xxhash/v2 v2.3.0 // indirect
+	github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
 	github.com/emicklei/go-restful/v3 v3.12.2 // indirect
+	github.com/evanphx/json-patch/v5 v5.9.11 // indirect
+	github.com/fsnotify/fsnotify v1.7.0 // indirect
 	github.com/fxamacker/cbor/v2 v2.9.0 // indirect
 	github.com/go-logr/logr v1.4.2 // indirect
+	github.com/go-logr/stdr v1.2.2 // indirect
+	github.com/go-logr/zapr v1.3.0 // indirect
 	github.com/go-openapi/jsonpointer v0.21.0 // indirect
 	github.com/go-openapi/jsonreference v0.20.2 // indirect
 	github.com/go-openapi/swag v0.23.0 // indirect
 	github.com/gogo/protobuf v1.3.2 // indirect
+	github.com/google/btree v1.1.3 // indirect
 	github.com/google/gnostic-models v0.7.0 // indirect
+	github.com/google/go-cmp v0.7.0 // indirect
 	github.com/google/uuid v1.6.0 // indirect
+	github.com/grpc-ecosystem/grpc-gateway/v2 v2.24.0 // indirect
 	github.com/josharian/intern v1.0.0 // indirect
 	github.com/json-iterator/go v1.1.12 // indirect
 	github.com/mailru/easyjson v0.7.7 // indirect
@@ -28,20 +44,36 @@ require (
 	github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee // indirect
 	github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
 	github.com/pkg/errors v0.9.1 // indirect
+	github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
+	github.com/prometheus/client_golang v1.19.1 // indirect
+	github.com/prometheus/client_model v0.6.1 // indirect
+	github.com/prometheus/common v0.55.0 // indirect
+	github.com/prometheus/procfs v0.15.1 // indirect
 	github.com/spf13/pflag v1.0.6 // indirect
 	github.com/x448/float16 v0.8.4 // indirect
+	go.opentelemetry.io/auto/sdk v1.1.0 // indirect
+	go.opentelemetry.io/otel/trace v1.33.0 // indirect
+	go.opentelemetry.io/proto/otlp v1.4.0 // indirect
+	go.uber.org/multierr v1.11.0 // indirect
+	go.uber.org/zap v1.27.0 // indirect
 	go.yaml.in/yaml/v2 v2.4.2 // indirect
 	go.yaml.in/yaml/v3 v3.0.4 // indirect
 	golang.org/x/net v0.38.0 // indirect
 	golang.org/x/oauth2 v0.27.0 // indirect
+	golang.org/x/sync v0.12.0 // indirect
 	golang.org/x/sys v0.31.0 // indirect
 	golang.org/x/term v0.30.0 // indirect
 	golang.org/x/text v0.23.0 // indirect
 	golang.org/x/time v0.9.0 // indirect
+	gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
+	google.golang.org/genproto/googleapis/api v0.0.0-20241209162323-e6fa225c2576 // indirect
+	google.golang.org/genproto/googleapis/rpc v0.0.0-20241209162323-e6fa225c2576 // indirect
+	google.golang.org/grpc v1.68.1 // indirect
 	google.golang.org/protobuf v1.36.5 // indirect
 	gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect
 	gopkg.in/inf.v0 v0.9.1 // indirect
 	gopkg.in/yaml.v3 v3.0.1 // indirect
+	k8s.io/apiextensions-apiserver v0.32.1 // indirect
 	k8s.io/klog/v2 v2.130.1 // indirect
 	k8s.io/kube-openapi v0.0.0-20250710124328-f3f2b991d03b // indirect
 	k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 // indirect
diff --git a/components/operator/go.sum b/components/operator/go.sum
index 236504ad6..2dfdc9496 100644
--- a/components/operator/go.sum
+++ b/components/operator/go.sum
@@ -1,13 +1,31 @@
+github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
+github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
+github.com/cenkalti/backoff/v4 v4.3.0 h1:MyRJ/UdXutAwSAT+s3wNd7MfTIcy71VQueUuFK343L8=
+github.com/cenkalti/backoff/v4 v4.3.0/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=
+github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
+github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
 github.com/creack/pty v1.1.9/go.mod h1:oKZEueFk5CKHvIhNR5MUki03XCEU+Q6VDXinZuGJ33E=
 github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
-github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
 github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
+github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1VwoXQT9A3Wy9MM3WgvqSxFWenqJduM=
+github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
 github.com/emicklei/go-restful/v3 v3.12.2 h1:DhwDP0vY3k8ZzE0RunuJy8GhNpPL6zqLkDf9B/a0/xU=
 github.com/emicklei/go-restful/v3 v3.12.2/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc=
+github.com/evanphx/json-patch v0.5.2 h1:xVCHIVMUu1wtM/VkR9jVZ45N3FhZfYMMYGorLCR8P3k=
+github.com/evanphx/json-patch v0.5.2/go.mod h1:ZWS5hhDbVDyob71nXKNL0+PWn6ToqBHMikGIFbs31qQ=
+github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjTM0wiaDU=
+github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/XEtnUf6OZxqIQTM=
+github.com/fsnotify/fsnotify v1.7.0 h1:8JEhPFa5W2WU7YfeZzPNqzMP6Lwt7L2715Ggo0nosvA=
+github.com/fsnotify/fsnotify v1.7.0/go.mod h1:40Bi/Hjc2AVfZrqy+aj+yEI+/bRxZnMJyTJwOpGvigM=
 github.com/fxamacker/cbor/v2 v2.9.0 h1:NpKPmjDBgUfBms6tr6JZkTHtfFGcMKsw3eGcmD/sapM=
 github.com/fxamacker/cbor/v2 v2.9.0/go.mod h1:vM4b+DJCtHn+zz7h3FFp/hDAI9WNWCsZj23V5ytsSxQ=
+github.com/go-logr/logr v1.2.2/go.mod h1:jdQByPbusPIv2/zmleS9BjJVeZ6kBagPoEUsqbVz/1A=
 github.com/go-logr/logr v1.4.2 h1:6pFjapn8bFcIbiKo3XT4j/BhANplGihG6tvd+8rYgrY=
 github.com/go-logr/logr v1.4.2/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
+github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag=
+github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE=
+github.com/go-logr/zapr v1.3.0 h1:XGdV8XW8zdwFiwOA2Dryh1gj2KRQyOOoNmBy4EplIcQ=
+github.com/go-logr/zapr v1.3.0/go.mod h1:YKepepNBd1u/oyhd/yQmtjVXmm9uML4IXUgMOwR8/Gg=
 github.com/go-openapi/jsonpointer v0.19.6/go.mod h1:osyAmYz/mB/C3I+WsTTSgw1ONzaLJoLCyoi6/zppojs=
 github.com/go-openapi/jsonpointer v0.21.0 h1:YgdVicSA9vH5RiHs9TZW5oyafXZFc6+2Vc1rr/O9oNQ=
 github.com/go-openapi/jsonpointer v0.21.0/go.mod h1:IUyH9l/+uyhIYQ/PXVA41Rexl+kOkAPDdXEYns6fzUY=
@@ -20,15 +38,23 @@ github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1v
 github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8=
 github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
 github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
+github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=
+github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=
+github.com/google/btree v1.1.3 h1:CVpQJjYgC4VbzxeGVHfvZrv1ctoYCAI8vbl07Fcxlyg=
+github.com/google/btree v1.1.3/go.mod h1:qOPhT0dTNdNzV6Z/lhRX0YXUafgPLFUh+gZMl761Gm4=
 github.com/google/gnostic-models v0.7.0 h1:qwTtogB15McXDaNqTZdzPJRHvaVJlAl+HVQnLmJEJxo=
 github.com/google/gnostic-models v0.7.0/go.mod h1:whL5G0m6dmc5cPxKc5bdKdEN3UjI7OUGxBlw57miDrQ=
 github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
 github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
 github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
+github.com/google/gofuzz v1.2.0 h1:xRy4A+RhZaiKjJ1bPfwQ8sedCA+YS2YcCHW6ec7JMi0=
+github.com/google/gofuzz v1.2.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
 github.com/google/pprof v0.0.0-20241029153458-d1b30febd7db h1:097atOisP2aRj7vFgYQBbFN4U4JNXUNYpxael3UzMyo=
 github.com/google/pprof v0.0.0-20241029153458-d1b30febd7db/go.mod h1:vavhavw2zAxS5dIdcRluK6cSGGPlZynqzFM8NdvU144=
 github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0=
 github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo=
+github.com/grpc-ecosystem/grpc-gateway/v2 v2.24.0 h1:TmHmbvxPmaegwhDubVz0lICL0J5Ka2vwTzhoePEXsGE=
+github.com/grpc-ecosystem/grpc-gateway/v2 v2.24.0/go.mod h1:qztMSjm835F2bXf+5HKAPIS5qsmQDqZna/PgVt4rWtI=
 github.com/josharian/intern v1.0.0 h1:vlS4z54oSdjm0bgjRigI+G1HpF+tI+9rE5LLzOg8HmY=
 github.com/josharian/intern v1.0.0/go.mod h1:5DoeVV0s6jJacbCEi61lwdGj/aVlrQvzHFFd8Hwg//Y=
 github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM=
@@ -52,14 +78,23 @@ github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee h1:W5t00kpgFd
 github.com/modern-go/reflect2 v1.0.3-0.20250322232337-35a7c28c31ee/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
 github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
 github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
-github.com/onsi/ginkgo/v2 v2.21.0 h1:7rg/4f3rB88pb5obDgNZrNHrQ4e6WpjonchcpuBRnZM=
-github.com/onsi/ginkgo/v2 v2.21.0/go.mod h1:7Du3c42kxCUegi0IImZ1wUQzMBVecgIHjR1C+NkhLQo=
-github.com/onsi/gomega v1.35.1 h1:Cwbd75ZBPxFSuZ6T+rN/WCb/gOc6YgFBXLlZLhC7Ds4=
-github.com/onsi/gomega v1.35.1/go.mod h1:PvZbdDc8J6XJEpDK4HCuRBm8a6Fzp9/DmhC9C7yFlog=
+github.com/onsi/ginkgo/v2 v2.22.0 h1:Yed107/8DjTr0lKCNt7Dn8yQ6ybuDRQoMGrNFKzMfHg=
+github.com/onsi/ginkgo/v2 v2.22.0/go.mod h1:7Du3c42kxCUegi0IImZ1wUQzMBVecgIHjR1C+NkhLQo=
+github.com/onsi/gomega v1.36.1 h1:bJDPBO7ibjxcbHMgSCoo4Yj18UWbKDlLwX1x9sybDcw=
+github.com/onsi/gomega v1.36.1/go.mod h1:PvZbdDc8J6XJEpDK4HCuRBm8a6Fzp9/DmhC9C7yFlog=
 github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
 github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
-github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
 github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
+github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U=
+github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
+github.com/prometheus/client_golang v1.19.1 h1:wZWJDwK+NameRJuPGDhlnFgx8e8HN3XHQeLaYJFJBOE=
+github.com/prometheus/client_golang v1.19.1/go.mod h1:mP78NwGzrVks5S2H6ab8+ZZGJLZUq1hoULYBAYBw1Ho=
+github.com/prometheus/client_model v0.6.1 h1:ZKSh/rekM+n3CeS952MLRAdFwIKqeY8b62p8ais2e9E=
+github.com/prometheus/client_model v0.6.1/go.mod h1:OrxVMOVHjw3lKMa8+x6HeMGkHMQyHDk9E3jmP2AmGiY=
+github.com/prometheus/common v0.55.0 h1:KEi6DK7lXW/m7Ig5i47x0vRzuBsHuvJdi5ee6Y3G1dc=
+github.com/prometheus/common v0.55.0/go.mod h1:2SECS4xJG1kd8XF9IcM1gMX6510RAEL65zxzNImwdc8=
+github.com/prometheus/procfs v0.15.1 h1:YagwOFzUgYfKKHX6Dr+sHT7km/hxC76UB0learggepc=
+github.com/prometheus/procfs v0.15.1/go.mod h1:fB45yRUv8NstnjriLhBQLuOUt+WW4BsoGhij/e3PBqk=
 github.com/rogpeppe/go-internal v1.13.1 h1:KvO1DLK/DRN07sQ1LQKScxyZJuNnedQ5/wKSR38lUII=
 github.com/rogpeppe/go-internal v1.13.1/go.mod h1:uMEvuHeurkdAXX61udpOXGD/AzZDWNMNyH2VO9fmH0o=
 github.com/spf13/pflag v1.0.6 h1:jFzHGLGAlb3ruxLB8MhbI6A8+AQX/2eW4qeyNZXNp2o=
@@ -79,6 +114,28 @@ github.com/x448/float16 v0.8.4 h1:qLwI1I70+NjRFUR3zs1JPUCgaCXSh3SW62uAKT1mSBM=
 github.com/x448/float16 v0.8.4/go.mod h1:14CWIYCyZA/cWjXOioeEpHeN/83MdbZDRQHoFcYsOfg=
 github.com/yuin/goldmark v1.1.27/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
 github.com/yuin/goldmark v1.2.1/go.mod h1:3hX8gzYuyVAZsxl0MRgGTJEmQBFcNTphYh9decYSb74=
+go.opentelemetry.io/auto/sdk v1.1.0 h1:cH53jehLUN6UFLY71z+NDOiNJqDdPRaXzTel0sJySYA=
+go.opentelemetry.io/auto/sdk v1.1.0/go.mod h1:3wSPjt5PWp2RhlCcmmOial7AvC4DQqZb7a7wCow3W8A=
+go.opentelemetry.io/otel v1.33.0 h1:/FerN9bax5LoK51X/sI0SVYrjSE0/yUL7DpxW4K3FWw=
+go.opentelemetry.io/otel v1.33.0/go.mod h1:SUUkR6csvUQl+yjReHu5uM3EtVV7MBm5FHKRlNx4I8I=
+go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.33.0 h1:7F29RDmnlqk6B5d+sUqemt8TBfDqxryYW5gX6L74RFA=
+go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc v1.33.0/go.mod h1:ZiGDq7xwDMKmWDrN1XsXAj0iC7hns+2DhxBFSncNHSE=
+go.opentelemetry.io/otel/metric v1.33.0 h1:r+JOocAyeRVXD8lZpjdQjzMadVZp2M4WmQ+5WtEnklQ=
+go.opentelemetry.io/otel/metric v1.33.0/go.mod h1:L9+Fyctbp6HFTddIxClbQkjtubW6O9QS3Ann/M82u6M=
+go.opentelemetry.io/otel/sdk v1.33.0 h1:iax7M131HuAm9QkZotNHEfstof92xM+N8sr3uHXc2IM=
+go.opentelemetry.io/otel/sdk v1.33.0/go.mod h1:A1Q5oi7/9XaMlIWzPSxLRWOI8nG3FnzHJNbiENQuihM=
+go.opentelemetry.io/otel/sdk/metric v1.33.0 h1:Gs5VK9/WUJhNXZgn8MR6ITatvAmKeIuCtNbsP3JkNqU=
+go.opentelemetry.io/otel/sdk/metric v1.33.0/go.mod h1:dL5ykHZmm1B1nVRk9dDjChwDmt81MjVp3gLkQRwKf/Q=
+go.opentelemetry.io/otel/trace v1.33.0 h1:cCJuF7LRjUFso9LPnEAHJDB2pqzp+hbO8eu1qqW2d/s=
+go.opentelemetry.io/otel/trace v1.33.0/go.mod h1:uIcdVUZMpTAmz0tI1z04GoVSezK37CbGV4fr1f2nBck=
+go.opentelemetry.io/proto/otlp v1.4.0 h1:TA9WRvW6zMwP+Ssb6fLoUIuirti1gGbP28GcKG1jgeg=
+go.opentelemetry.io/proto/otlp v1.4.0/go.mod h1:PPBWZIP98o2ElSqI35IHfu7hIhSwvc5N38Jw8pXuGFY=
+go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
+go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
+go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0=
+go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y=
+go.uber.org/zap v1.27.0 h1:aJMhYGrd5QSmlpLMr2MftRKl7t8J8PTZPA732ud/XR8=
+go.uber.org/zap v1.27.0/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
 go.yaml.in/yaml/v2 v2.4.2 h1:DzmwEr2rDGHl7lsFgAHxmNz/1NlQ7xLIrlN2h5d1eGI=
 go.yaml.in/yaml/v2 v2.4.2/go.mod h1:081UH+NErpNdqlCXm3TtEran0rJZGxAYx9hb/ELlsPU=
 go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc=
@@ -99,6 +156,8 @@ golang.org/x/oauth2 v0.27.0/go.mod h1:onh5ek6nERTohokkhCD/y2cV4Do3fxFHFuAejCkRWT
 golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
 golang.org/x/sync v0.0.0-20190911185100-cd5d95a43a6e/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
 golang.org/x/sync v0.0.0-20201020160332-67f06af15bc9/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
+golang.org/x/sync v0.12.0 h1:MHc5BpPuC30uJk597Ri8TV3CNZcTLu6B6z4lJy+g6Jw=
+golang.org/x/sync v0.12.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA=
 golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
 golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
@@ -122,6 +181,14 @@ golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8T
 golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
+gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw=
+gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY=
+google.golang.org/genproto/googleapis/api v0.0.0-20241209162323-e6fa225c2576 h1:CkkIfIt50+lT6NHAVoRYEyAvQGFM7xEwXUUywFvEb3Q=
+google.golang.org/genproto/googleapis/api v0.0.0-20241209162323-e6fa225c2576/go.mod h1:1R3kvZ1dtP3+4p4d3G8uJ8rFk/fWlScl38vanWACI08=
+google.golang.org/genproto/googleapis/rpc v0.0.0-20241209162323-e6fa225c2576 h1:8ZmaLZE4XWrtU3MyClkYqqtl6Oegr3235h7jxsDyqCY=
+google.golang.org/genproto/googleapis/rpc v0.0.0-20241209162323-e6fa225c2576/go.mod h1:5uTbfoYQed2U9p3KIj2/Zzm02PYhndfdmML0qC3q3FU=
+google.golang.org/grpc v1.68.1 h1:oI5oTa11+ng8r8XMMN7jAOmWfPZWbYpCFaMUTACxkM0=
+google.golang.org/grpc v1.68.1/go.mod h1:+q1XYFJjShcqn0QZHvCyeR4CXPA+llXIeUIfIe00waw=
 google.golang.org/protobuf v1.36.5 h1:tPhr+woSbjfYvY6/GPufUoYizxw1cF/yFoxJ2fmpwlM=
 google.golang.org/protobuf v1.36.5/go.mod h1:9fA7Ob0pmnwhb644+1+CVWFRbNajQ6iRojtC/QF5bRE=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
@@ -136,6 +203,8 @@ gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
 gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
 k8s.io/api v0.34.0 h1:L+JtP2wDbEYPUeNGbeSa/5GwFtIA662EmT2YSLOkAVE=
 k8s.io/api v0.34.0/go.mod h1:YzgkIzOOlhl9uwWCZNqpw6RJy9L2FK4dlJeayUoydug=
+k8s.io/apiextensions-apiserver v0.32.1 h1:hjkALhRUeCariC8DiVmb5jj0VjIc1N0DREP32+6UXZw=
+k8s.io/apiextensions-apiserver v0.32.1/go.mod h1:sxWIGuGiYov7Io1fAS2X06NjMIk5CbRHc2StSmbaQto=
 k8s.io/apimachinery v0.34.0 h1:eR1WO5fo0HyoQZt1wdISpFDffnWOvFLOOeJ7MgIv4z0=
 k8s.io/apimachinery v0.34.0/go.mod h1:/GwIlEcWuTX9zKIg2mbw0LRFIsXwrfoVxn+ef0X13lw=
 k8s.io/client-go v0.34.0 h1:YoWv5r7bsBfb0Hs2jh8SOvFbKzzxyNo0nSb0zC19KZo=
@@ -146,6 +215,8 @@ k8s.io/kube-openapi v0.0.0-20250710124328-f3f2b991d03b h1:MloQ9/bdJyIu9lb1PzujOP
 k8s.io/kube-openapi v0.0.0-20250710124328-f3f2b991d03b/go.mod h1:UZ2yyWbFTpuhSbFhv24aGNOdoRdJZgsIObGBUaYVsts=
 k8s.io/utils v0.0.0-20250604170112-4c0f3b243397 h1:hwvWFiBzdWw1FhfY1FooPn3kzWuJ8tmbZBHi4zVsl1Y=
 k8s.io/utils v0.0.0-20250604170112-4c0f3b243397/go.mod h1:OLgZIPagt7ERELqWJFomSt595RzquPNLL48iOWgYOg0=
+sigs.k8s.io/controller-runtime v0.20.4 h1:X3c+Odnxz+iPTRobG4tp092+CvBU9UK0t/bRf+n0DGU=
+sigs.k8s.io/controller-runtime v0.20.4/go.mod h1:xg2XB0K5ShQzAgsoujxuKN4LNXR2LfwwHsPj7Iaw+XY=
 sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 h1:gBQPwqORJ8d8/YNZWEjoZs7npUVDpVXUUOFfW6CgAqE=
 sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8/go.mod h1:mdzfpAEoE6DHQEN0uh9ZbOCuHbLK5wOm7dK4ctXE9Tg=
 sigs.k8s.io/randfill v1.0.0 h1:JfjMILfT8A6RbawdsK2JXGBR5AQVfd+9TbzrlneTyrU=
diff --git a/components/operator/internal/config/config.go b/components/operator/internal/config/config.go
index 73b33b978..4ea0f90bb 100644
--- a/components/operator/internal/config/config.go
+++ b/components/operator/internal/config/config.go
@@ -24,7 +24,10 @@ type Config struct {
 	BackendNamespace       string
 	AmbientCodeRunnerImage string
 	ContentServiceImage    string
+	StateSyncImage         string
 	ImagePullPolicy        corev1.PullPolicy
+	S3Endpoint             string
+	S3Bucket               string
 }
 
 // InitK8sClients initializes the Kubernetes clients
@@ -45,6 +48,11 @@ func InitK8sClients() error {
 		}
 	}
 
+	// Increase QPS and Burst to avoid client-side throttling
+	// Default is QPS=5, Burst=10 which is too low for operators handling concurrent sessions
+	config.QPS = 100
+	config.Burst = 200
+
 	// Create standard Kubernetes client
 	K8sClient, err = kubernetes.NewForConfig(config)
 	if err != nil {
@@ -86,6 +94,12 @@ func LoadConfig() *Config {
 		contentServiceImage = "quay.io/ambient_code/vteam_backend:latest"
 	}
 
+	// Get state-sync image from environment or use default
+	stateSyncImage := os.Getenv("STATE_SYNC_IMAGE")
+	if stateSyncImage == "" {
+		stateSyncImage = "quay.io/ambient_code/vteam_state_sync:latest"
+	}
+
 	// Get image pull policy from environment or use default
 	imagePullPolicyStr := os.Getenv("IMAGE_PULL_POLICY")
 	if imagePullPolicyStr == "" {
@@ -93,11 +107,25 @@ func LoadConfig() *Config {
 	}
 	imagePullPolicy := corev1.PullPolicy(imagePullPolicyStr)
 
+	// Get S3 configuration from environment
+	s3Endpoint := os.Getenv("S3_ENDPOINT")
+	if s3Endpoint == "" {
+		s3Endpoint = "https://s3.amazonaws.com"
+	}
+
+	s3Bucket := os.Getenv("S3_BUCKET")
+	if s3Bucket == "" {
+		s3Bucket = "ambient-sessions"
+	}
+
 	return &Config{
 		Namespace:              namespace,
 		BackendNamespace:       backendNamespace,
 		AmbientCodeRunnerImage: ambientCodeRunnerImage,
 		ContentServiceImage:    contentServiceImage,
+		StateSyncImage:         stateSyncImage,
 		ImagePullPolicy:        imagePullPolicy,
+		S3Endpoint:             s3Endpoint,
+		S3Bucket:               s3Bucket,
 	}
 }
diff --git a/components/operator/internal/controller/agenticsession_controller.go b/components/operator/internal/controller/agenticsession_controller.go
new file mode 100644
index 000000000..f85e2167a
--- /dev/null
+++ b/components/operator/internal/controller/agenticsession_controller.go
@@ -0,0 +1,301 @@
+// Package controller implements Kubernetes controllers using controller-runtime.
+// This provides work queues, concurrent reconciliation, rate limiting, and proper
+// event handling that scales much better than raw watch loops.
+package controller
+
+import (
+	"context"
+	"fmt"
+	"strings"
+	"time"
+
+	corev1 "k8s.io/api/core/v1"
+	"k8s.io/apimachinery/pkg/api/errors"
+	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
+	"k8s.io/apimachinery/pkg/runtime/schema"
+	"k8s.io/apimachinery/pkg/types"
+	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/client"
+	"sigs.k8s.io/controller-runtime/pkg/controller"
+	"sigs.k8s.io/controller-runtime/pkg/event"
+	"sigs.k8s.io/controller-runtime/pkg/handler"
+	"sigs.k8s.io/controller-runtime/pkg/log"
+	"sigs.k8s.io/controller-runtime/pkg/predicate"
+	"sigs.k8s.io/controller-runtime/pkg/source"
+
+	"ambient-code-operator/internal/config"
+	"ambient-code-operator/internal/handlers"
+	optypes "ambient-code-operator/internal/types"
+)
+
+// AgenticSessionReconciler reconciles AgenticSession resources.
+// It uses controller-runtime's work queue and concurrent reconcilers
+// for better performance under load.
+type AgenticSessionReconciler struct {
+	client.Client
+
+	// MaxConcurrentReconciles controls how many sessions can be reconciled in parallel.
+	// Higher values allow more throughput but consume more resources.
+	MaxConcurrentReconciles int
+
+	// appConfig holds operator configuration (images, namespaces, etc.)
+	appConfig *config.Config
+}
+
+// NewAgenticSessionReconciler creates a new reconciler with the given configuration.
+func NewAgenticSessionReconciler(c client.Client, maxConcurrent int) *AgenticSessionReconciler {
+	return &AgenticSessionReconciler{
+		Client:                  c,
+		MaxConcurrentReconciles: maxConcurrent,
+		appConfig:               config.LoadConfig(),
+	}
+}
+
+// Reconcile handles a single AgenticSession reconciliation.
+// This is called from the work queue, NOT directly from watch events.
+// The work queue provides:
+// - Event deduplication (multiple rapid events = single reconcile)
+// - Rate limiting (prevents API server overload)
+// - Automatic retries with exponential backoff
+func (r *AgenticSessionReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
+	logger := log.FromContext(ctx)
+	reconcileStart := time.Now()
+
+	// Fetch the AgenticSession
+	session := &unstructured.Unstructured{}
+	session.SetGroupVersionKind(schema.GroupVersionKind{
+		Group:   "vteam.ambient-code",
+		Version: "v1alpha1",
+		Kind:    "AgenticSession",
+	})
+
+	if err := r.Get(ctx, req.NamespacedName, session); err != nil {
+		if errors.IsNotFound(err) {
+			// Object deleted - cleanup is handled by OwnerReferences
+			logger.V(1).Info("AgenticSession deleted", "name", req.Name, "namespace", req.Namespace)
+			return ctrl.Result{}, nil
+		}
+		return ctrl.Result{}, fmt.Errorf("failed to get AgenticSession: %w", err)
+	}
+
+	// Check if namespace is managed
+	if !r.isNamespaceManaged(ctx, session.GetNamespace()) {
+		logger.V(2).Info("Skipping unmanaged namespace", "namespace", session.GetNamespace())
+		return ctrl.Result{}, nil
+	}
+
+	// Get current phase
+	status, _, _ := unstructured.NestedMap(session.Object, "status")
+	phase := ""
+	if status != nil {
+		if p, ok := status["phase"].(string); ok {
+			phase = p
+		}
+	}
+
+	logger.V(1).Info("Reconciling AgenticSession",
+		"name", session.GetName(),
+		"namespace", session.GetNamespace(),
+		"phase", phase,
+	)
+
+	// Delegate to the appropriate phase handler
+	// Each handler returns a Result indicating whether to requeue
+	var result ctrl.Result
+	var err error
+
+	switch phase {
+	case "", "Pending":
+		result, err = r.reconcilePending(ctx, session)
+	case "Creating":
+		result, err = r.reconcileCreating(ctx, session)
+	case "Running":
+		result, err = r.reconcileRunning(ctx, session)
+	case "Stopping":
+		result, err = r.reconcileStopping(ctx, session)
+	case "Stopped", "Completed", "Failed":
+		// Check if user wants to restart (desired-phase=Running)
+		annotations := session.GetAnnotations()
+		if annotations != nil && annotations["ambient-code.io/desired-phase"] == "Running" {
+			logger.Info("Restarting session from terminal phase",
+				"name", session.GetName(),
+				"currentPhase", phase,
+			)
+			// Reset to Pending to restart the session
+			if err := handlers.ResetToPending(ctx, session); err != nil {
+				logger.Error(err, "Failed to reset session to Pending for restart")
+				return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+			}
+			// Requeue to process the Pending phase
+			return ctrl.Result{Requeue: true}, nil
+		}
+		// No restart requested - terminal phases, no action needed
+		result, err = ctrl.Result{}, nil
+	default:
+		logger.Info("Unknown phase", "phase", phase)
+		result, err = ctrl.Result{}, nil
+	}
+
+	// Record reconcile duration metric
+	reconcileDuration := time.Since(reconcileStart).Seconds()
+	success := err == nil
+	if phase == "" {
+		phase = "Pending" // Normalize empty phase
+	}
+	RecordReconcileDuration(phase, reconcileDuration, success)
+
+	if err != nil {
+		logger.Error(err, "Reconciliation failed",
+			"name", session.GetName(),
+			"phase", phase,
+		)
+		// Requeue with backoff on error
+		return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+	}
+
+	return result, nil
+}
+
+// isNamespaceManaged checks if the namespace has the managed label
+func (r *AgenticSessionReconciler) isNamespaceManaged(ctx context.Context, namespace string) bool {
+	ns := &unstructured.Unstructured{}
+	ns.SetGroupVersionKind(schema.GroupVersionKind{
+		Group:   "",
+		Version: "v1",
+		Kind:    "Namespace",
+	})
+
+	if err := r.Get(ctx, types.NamespacedName{Name: namespace}, ns); err != nil {
+		return false
+	}
+
+	labels := ns.GetLabels()
+	return labels != nil && labels["ambient-code.io/managed"] == "true"
+}
+
+// SetupWithManager sets up the controller with the Manager.
+// This configures:
+// - The work queue with rate limiting
+// - Concurrent reconcilers for parallel processing
+// - Watch predicates to filter events
+func (r *AgenticSessionReconciler) SetupWithManager(mgr ctrl.Manager) error {
+	// Get max concurrent reconciles from env or use default
+	maxConcurrent := r.MaxConcurrentReconciles
+	if maxConcurrent <= 0 {
+		maxConcurrent = 10 // Default to 10 concurrent reconcilers
+	}
+
+	// Create the controller with concurrency settings
+	c, err := controller.New("agenticsession-controller", mgr, controller.Options{
+		Reconciler:              r,
+		MaxConcurrentReconciles: maxConcurrent,
+		// RateLimiter uses the default workqueue.DefaultControllerRateLimiter()
+		// which provides exponential backoff: 5ms, 10ms, 20ms... up to 1000s max
+	})
+	if err != nil {
+		return fmt.Errorf("unable to create controller: %w", err)
+	}
+
+	// Watch AgenticSessions with predicates to filter unnecessary events
+	agenticSessionGVK := schema.GroupVersionKind{
+		Group:   "vteam.ambient-code",
+		Version: "v1alpha1",
+		Kind:    "AgenticSession",
+	}
+
+	u := &unstructured.Unstructured{}
+	u.SetGroupVersionKind(agenticSessionGVK)
+
+	// Create typed predicates for *unstructured.Unstructured
+	// This reduces work queue pressure by skipping events we don't care about
+	typedPredicates := predicate.TypedFuncs[*unstructured.Unstructured]{
+		CreateFunc: func(e event.TypedCreateEvent[*unstructured.Unstructured]) bool {
+			return true // Always process Create events
+		},
+		UpdateFunc: func(e event.TypedUpdateEvent[*unstructured.Unstructured]) bool {
+			// Process if generation changed (spec update)
+			if e.ObjectNew.GetGeneration() != e.ObjectOld.GetGeneration() {
+				return true
+			}
+			// Process if annotations changed (desired-phase, etc.)
+			oldAnns := e.ObjectOld.GetAnnotations()
+			newAnns := e.ObjectNew.GetAnnotations()
+			if !mapsEqual(oldAnns, newAnns) {
+				return true
+			}
+			// Process if status changed (phase transitions)
+			oldStatus, _, _ := unstructured.NestedMap(e.ObjectOld.Object, "status")
+			newStatus, _, _ := unstructured.NestedMap(e.ObjectNew.Object, "status")
+			oldPhase, _ := oldStatus["phase"].(string)
+			newPhase, _ := newStatus["phase"].(string)
+			return oldPhase != newPhase
+		},
+		DeleteFunc: func(e event.TypedDeleteEvent[*unstructured.Unstructured]) bool {
+			return true // Always process Delete events
+		},
+		GenericFunc: func(e event.TypedGenericEvent[*unstructured.Unstructured]) bool {
+			return true
+		},
+	}
+
+	if err := c.Watch(
+		source.Kind(mgr.GetCache(), u, &handler.TypedEnqueueRequestForObject[*unstructured.Unstructured]{}, typedPredicates),
+	); err != nil {
+		return fmt.Errorf("unable to watch AgenticSessions: %w", err)
+	}
+
+	// Watch Pods and trigger reconcile on parent AgenticSession
+	// This eliminates polling delays - we react immediately when pod status changes
+	podHandler := handler.TypedEnqueueRequestForOwner[*corev1.Pod](
+		mgr.GetScheme(),
+		mgr.GetRESTMapper(),
+		u, // Owner type: AgenticSession
+		handler.OnlyControllerOwner(),
+	)
+	podPredicate := predicate.TypedFuncs[*corev1.Pod]{
+		CreateFunc: func(e event.TypedCreateEvent[*corev1.Pod]) bool {
+			return strings.HasSuffix(e.Object.Name, "-runner")
+		},
+		UpdateFunc: func(e event.TypedUpdateEvent[*corev1.Pod]) bool {
+			// Only trigger on status changes for runner pods
+			if !strings.HasSuffix(e.ObjectNew.Name, "-runner") {
+				return false
+			}
+			// Trigger if phase changed
+			return e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase
+		},
+		DeleteFunc: func(e event.TypedDeleteEvent[*corev1.Pod]) bool {
+			return strings.HasSuffix(e.Object.Name, "-runner")
+		},
+		GenericFunc: func(e event.TypedGenericEvent[*corev1.Pod]) bool {
+			return false
+		},
+	}
+	if err := c.Watch(
+		source.Kind(mgr.GetCache(), &corev1.Pod{}, podHandler, podPredicate),
+	); err != nil {
+		// Non-fatal: fall back to polling if pod watch fails
+		log.Log.Info("Warning: unable to watch Pods, falling back to polling", "error", err)
+	}
+
+	return nil
+}
+
+
+// GetGVR returns the GroupVersionResource for AgenticSession
+func GetGVR() schema.GroupVersionResource {
+	return optypes.GetAgenticSessionResource()
+}
+
+// mapsEqual compares two string maps for equality
+func mapsEqual(a, b map[string]string) bool {
+	if len(a) != len(b) {
+		return false
+	}
+	for k, v := range a {
+		if bv, ok := b[k]; !ok || bv != v {
+			return false
+		}
+	}
+	return true
+}
diff --git a/components/operator/internal/controller/otel_metrics.go b/components/operator/internal/controller/otel_metrics.go
new file mode 100644
index 000000000..d6c4fba0f
--- /dev/null
+++ b/components/operator/internal/controller/otel_metrics.go
@@ -0,0 +1,467 @@
+package controller
+
+import (
+	"context"
+	"fmt"
+	"log"
+	"os"
+	"time"
+
+	"go.opentelemetry.io/otel"
+	"go.opentelemetry.io/otel/attribute"
+	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
+	"go.opentelemetry.io/otel/metric"
+	sdkmetric "go.opentelemetry.io/otel/sdk/metric"
+	"go.opentelemetry.io/otel/sdk/resource"
+	semconv "go.opentelemetry.io/otel/semconv/v1.26.0"
+	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
+	"k8s.io/apimachinery/pkg/runtime/schema"
+
+	"ambient-code-operator/internal/config"
+)
+
+var (
+	meter metric.Meter
+
+	// Session lifecycle metrics (histograms)
+	sessionStartupDuration metric.Float64Histogram
+	sessionTotalDuration   metric.Float64Histogram
+	reconcileDuration      metric.Float64Histogram
+	tokenProvisionDuration metric.Float64Histogram
+	imagePullDuration      metric.Float64Histogram
+
+	// Session lifecycle metrics (counters)
+	sessionPhaseTransitions metric.Int64Counter
+	sessionsCompleted       metric.Int64Counter
+	sessionsByUser          metric.Int64Counter
+	sessionsByProject       metric.Int64Counter
+
+	// Error metrics (counters)
+	reconcileRetries     metric.Int64Counter
+	sessionTimeouts      metric.Int64Counter
+	s3Errors             metric.Int64Counter
+	tokenRefreshErrors   metric.Int64Counter
+	podRestarts          metric.Int64Counter
+)
+
+// InitMetrics initializes OpenTelemetry metrics
+func InitMetrics(ctx context.Context) (func(), error) {
+	// Get OTLP endpoint from environment
+	endpoint := os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
+	if endpoint == "" {
+		endpoint = "localhost:4317" // Default OTLP gRPC endpoint
+	}
+
+	// Create resource with service information
+	res, err := resource.New(ctx,
+		resource.WithAttributes(
+			semconv.ServiceName("agentic-operator"),
+			semconv.ServiceVersion(os.Getenv("VERSION")),
+			attribute.String("deployment.environment", os.Getenv("DEPLOYMENT_ENV")),
+		),
+	)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create resource: %w", err)
+	}
+
+	// Create OTLP exporter
+	exporter, err := otlpmetricgrpc.New(ctx,
+		otlpmetricgrpc.WithEndpoint(endpoint),
+		otlpmetricgrpc.WithInsecure(), // Use TLS in production
+	)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create OTLP exporter: %w", err)
+	}
+
+	// Create meter provider with periodic reader
+	meterProvider := sdkmetric.NewMeterProvider(
+		sdkmetric.WithResource(res),
+		sdkmetric.WithReader(
+			sdkmetric.NewPeriodicReader(exporter,
+				sdkmetric.WithInterval(30*time.Second),
+			),
+		),
+	)
+
+	// Set global meter provider
+	otel.SetMeterProvider(meterProvider)
+
+	// Get meter for this package
+	meter = meterProvider.Meter("ambient-code-operator")
+
+	// Initialize metrics
+	if err := initInstruments(); err != nil {
+		return nil, fmt.Errorf("failed to initialize instruments: %w", err)
+	}
+
+	log.Println("OpenTelemetry metrics initialized, exporting to:", endpoint)
+
+	// Return shutdown function
+	return func() {
+		ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if err := meterProvider.Shutdown(ctx); err != nil {
+			log.Printf("Error shutting down meter provider: %v", err)
+		}
+	}, nil
+}
+
+// initInstruments creates all metric instruments
+func initInstruments() error {
+	var err error
+
+	// === HISTOGRAMS (Duration metrics) ===
+
+	// Session startup duration
+	sessionStartupDuration, err = meter.Float64Histogram(
+		"ambient.session.startup.duration",
+		metric.WithDescription("Time from session creation (Pending) to Running phase in seconds"),
+		metric.WithUnit("s"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionStartupDuration: %w", err)
+	}
+
+	// Session total duration
+	sessionTotalDuration, err = meter.Float64Histogram(
+		"ambient.session.total.duration",
+		metric.WithDescription("Total time session was running (startTime to completionTime) in seconds"),
+		metric.WithUnit("s"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionTotalDuration: %w", err)
+	}
+
+	// Reconcile duration
+	reconcileDuration, err = meter.Float64Histogram(
+		"ambient.reconcile.duration",
+		metric.WithDescription("Time spent in reconciliation in seconds"),
+		metric.WithUnit("s"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create reconcileDuration: %w", err)
+	}
+
+	// Token provision duration
+	tokenProvisionDuration, err = meter.Float64Histogram(
+		"ambient.token.provision.duration",
+		metric.WithDescription("Time to provision runner token secret in seconds"),
+		metric.WithUnit("s"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create tokenProvisionDuration: %w", err)
+	}
+
+	// Image pull duration
+	imagePullDuration, err = meter.Float64Histogram(
+		"ambient.image.pull.duration",
+		metric.WithDescription("Time to pull container image in seconds"),
+		metric.WithUnit("s"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create imagePullDuration: %w", err)
+	}
+
+	// === COUNTERS (Lifecycle and business metrics) ===
+
+	// Phase transitions
+	sessionPhaseTransitions, err = meter.Int64Counter(
+		"ambient.session.phase.transitions",
+		metric.WithDescription("Total number of session phase transitions"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionPhaseTransitions: %w", err)
+	}
+
+	// Sessions completed
+	sessionsCompleted, err = meter.Int64Counter(
+		"ambient.sessions.completed",
+		metric.WithDescription("Total number of sessions that reached terminal states"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionsCompleted: %w", err)
+	}
+
+	// Sessions by user
+	sessionsByUser, err = meter.Int64Counter(
+		"ambient.sessions.by_user",
+		metric.WithDescription("Total sessions created per user"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionsByUser: %w", err)
+	}
+
+	// Sessions by project
+	sessionsByProject, err = meter.Int64Counter(
+		"ambient.sessions.by_project",
+		metric.WithDescription("Total sessions created per project/namespace"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionsByProject: %w", err)
+	}
+
+	// === COUNTERS (Error metrics) ===
+
+	// Reconcile retries
+	reconcileRetries, err = meter.Int64Counter(
+		"ambient.reconcile.retries",
+		metric.WithDescription("Number of reconciliation retries due to errors"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create reconcileRetries: %w", err)
+	}
+
+	// Session timeouts
+	sessionTimeouts, err = meter.Int64Counter(
+		"ambient.session.timeouts",
+		metric.WithDescription("Number of sessions that timed out"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessionTimeouts: %w", err)
+	}
+
+	// S3 errors
+	s3Errors, err = meter.Int64Counter(
+		"ambient.s3.errors",
+		metric.WithDescription("Number of S3 operation failures"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create s3Errors: %w", err)
+	}
+
+	// Token refresh errors
+	tokenRefreshErrors, err = meter.Int64Counter(
+		"ambient.token.refresh.errors",
+		metric.WithDescription("Number of token refresh failures"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create tokenRefreshErrors: %w", err)
+	}
+
+	// Pod restarts
+	podRestarts, err = meter.Int64Counter(
+		"ambient.pod.restarts",
+		metric.WithDescription("Number of pod restarts"),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create podRestarts: %w", err)
+	}
+
+	// === GAUGES (Async callbacks) ===
+	if err := initGauges(); err != nil {
+		return fmt.Errorf("failed to initialize gauges: %w", err)
+	}
+
+	return nil
+}
+
+// initGauges initializes gauge metrics with async callbacks
+func initGauges() error {
+	var err error
+
+	// Active sessions gauge
+	_, err = meter.Int64ObservableGauge(
+		"ambient.sessions.active",
+		metric.WithDescription("Number of currently running sessions"),
+		metric.WithInt64Callback(func(ctx context.Context, o metric.Int64Observer) error {
+			counts := countSessionsByPhase(ctx, "Running")
+			for ns, count := range counts {
+				o.Observe(count, metric.WithAttributes(
+					attribute.String("namespace", ns),
+				))
+			}
+			return nil
+		}),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessions.active gauge: %w", err)
+	}
+
+	// Pending sessions gauge
+	_, err = meter.Int64ObservableGauge(
+		"ambient.sessions.pending",
+		metric.WithDescription("Number of sessions waiting to start"),
+		metric.WithInt64Callback(func(ctx context.Context, o metric.Int64Observer) error {
+			counts := countSessionsByPhase(ctx, "Pending", "Creating")
+			for ns, count := range counts {
+				o.Observe(count, metric.WithAttributes(
+					attribute.String("namespace", ns),
+				))
+			}
+			return nil
+		}),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create sessions.pending gauge: %w", err)
+	}
+
+	// S3 storage bytes gauge
+	_, err = meter.Int64ObservableGauge(
+		"ambient.s3.storage.bytes",
+		metric.WithDescription("Total S3 storage used per namespace in bytes"),
+		metric.WithInt64Callback(func(ctx context.Context, o metric.Int64Observer) error {
+			// TODO: Implement S3 storage calculation when S3 metrics endpoint is available
+			// For now, return 0 to prevent errors
+			return nil
+		}),
+	)
+	if err != nil {
+		return fmt.Errorf("failed to create s3.storage.bytes gauge: %w", err)
+	}
+
+	return nil
+}
+
+// countSessionsByPhase counts sessions in the given phases, grouped by namespace
+func countSessionsByPhase(ctx context.Context, phases ...string) map[string]int64 {
+	counts := make(map[string]int64)
+
+	// Use config.DynamicClient to list sessions
+	if config.DynamicClient == nil {
+		return counts
+	}
+
+	gvr := schema.GroupVersionResource{
+		Group:    "vteam.ambient-code",
+		Version:  "v1alpha1",
+		Resource: "agenticsessions",
+	}
+
+	list, err := config.DynamicClient.Resource(gvr).List(ctx, v1.ListOptions{})
+	if err != nil {
+		log.Printf("Failed to list sessions for metrics: %v", err)
+		return counts
+	}
+
+	phaseSet := make(map[string]bool)
+	for _, p := range phases {
+		phaseSet[p] = true
+	}
+
+	for _, item := range list.Items {
+		ns := item.GetNamespace()
+		status, found, _ := unstructured.NestedMap(item.Object, "status")
+		if !found {
+			continue
+		}
+
+		phase, ok := status["phase"].(string)
+		if !ok {
+			phase = "Pending"
+		}
+
+		if phaseSet[phase] {
+			counts[ns]++
+		}
+	}
+
+	return counts
+}
+
+// Record functions for metrics
+
+// === Duration metrics (histograms) ===
+
+func RecordSessionStartupDuration(namespace string, duration float64) {
+	sessionStartupDuration.Record(context.Background(), duration,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+func RecordSessionTotalDuration(namespace string, duration float64) {
+	sessionTotalDuration.Record(context.Background(), duration,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+func RecordReconcileDuration(phase string, duration float64, success bool) {
+	successStr := "true"
+	if !success {
+		successStr = "false"
+	}
+	reconcileDuration.Record(context.Background(), duration,
+		metric.WithAttributes(
+			attribute.String("phase", phase),
+			attribute.String("success", successStr),
+		))
+}
+
+func RecordTokenProvisionDuration(namespace string, duration float64) {
+	tokenProvisionDuration.Record(context.Background(), duration,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+func RecordImagePullDuration(namespace, image string, duration float64) {
+	imagePullDuration.Record(context.Background(), duration,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("image", image),
+		))
+}
+
+// === Lifecycle counters ===
+
+func RecordPhaseTransition(namespace, fromPhase, toPhase string) {
+	sessionPhaseTransitions.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("from_phase", fromPhase),
+			attribute.String("to_phase", toPhase),
+		))
+}
+
+func RecordSessionCompleted(namespace, finalPhase string) {
+	sessionsCompleted.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("final_phase", finalPhase),
+		))
+}
+
+func RecordSessionCreatedByUser(namespace, user string) {
+	sessionsByUser.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("user", user),
+		))
+}
+
+func RecordSessionCreatedByProject(namespace string) {
+	sessionsByProject.Add(context.Background(), 1,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+// === Error counters ===
+
+func RecordReconcileRetry(namespace, phase string) {
+	reconcileRetries.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("phase", phase),
+		))
+}
+
+func RecordSessionTimeout(namespace string) {
+	sessionTimeouts.Add(context.Background(), 1,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+func RecordS3Error(namespace, operation string) {
+	s3Errors.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("operation", operation),
+		))
+}
+
+func RecordTokenRefreshError(namespace string) {
+	tokenRefreshErrors.Add(context.Background(), 1,
+		metric.WithAttributes(attribute.String("namespace", namespace)))
+}
+
+func RecordPodRestart(namespace, session string) {
+	podRestarts.Add(context.Background(), 1,
+		metric.WithAttributes(
+			attribute.String("namespace", namespace),
+			attribute.String("session", session),
+		))
+}
diff --git a/components/operator/internal/controller/reconcile_phases.go b/components/operator/internal/controller/reconcile_phases.go
new file mode 100644
index 000000000..3dbe4db62
--- /dev/null
+++ b/components/operator/internal/controller/reconcile_phases.go
@@ -0,0 +1,382 @@
+package controller
+
+import (
+	"context"
+	"fmt"
+	"strings"
+	"time"
+
+	corev1 "k8s.io/api/core/v1"
+	"k8s.io/apimachinery/pkg/api/errors"
+	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
+	"k8s.io/apimachinery/pkg/types"
+	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/log"
+
+	"ambient-code-operator/internal/handlers"
+)
+
+// recordPhaseTransition records a phase transition
+func recordPhaseTransition(namespace, fromPhase, toPhase string) {
+	if fromPhase == "" {
+		fromPhase = "None"
+	}
+	if toPhase == "" {
+		toPhase = "Unknown"
+	}
+	RecordPhaseTransition(namespace, fromPhase, toPhase)
+}
+
+// recordSessionCreated records a new session was created
+func recordSessionCreated(namespace string, session *unstructured.Unstructured) {
+	// Extract user from annotations
+	user := "unknown"
+	annotations := session.GetAnnotations()
+	if annotations != nil {
+		if createdBy := annotations["ambient-code.io/created-by"]; createdBy != "" {
+			user = createdBy
+		}
+	}
+	RecordSessionCreatedByUser(namespace, user)
+	RecordSessionCreatedByProject(namespace)
+}
+
+// recordSessionCompleted records a session reached a terminal state
+func recordSessionCompleted(namespace, finalPhase string, session *unstructured.Unstructured) {
+	RecordSessionCompleted(namespace, finalPhase)
+	recordSessionDuration(namespace, session)
+}
+
+// recordSessionDuration calculates and records the total session duration
+func recordSessionDuration(namespace string, session *unstructured.Unstructured) {
+	status, _, _ := unstructured.NestedMap(session.Object, "status")
+	if status == nil {
+		return
+	}
+
+	startTimeStr, ok := status["startTime"].(string)
+	if !ok || startTimeStr == "" {
+		return
+	}
+
+	completionTimeStr, ok := status["completionTime"].(string)
+	if !ok || completionTimeStr == "" {
+		return
+	}
+
+	startTime, err := time.Parse(time.RFC3339, startTimeStr)
+	if err != nil {
+		return
+	}
+
+	completionTime, err := time.Parse(time.RFC3339, completionTimeStr)
+	if err != nil {
+		return
+	}
+
+	duration := completionTime.Sub(startTime).Seconds()
+	RecordSessionTotalDuration(namespace, duration)
+
+	// Log the total session duration
+	log.Log.Info("Session completed",
+		"namespace", namespace,
+		"session", session.GetName(),
+		"total_duration_seconds", fmt.Sprintf("%.2f", duration),
+	)
+}
+
+// recordImagePullDuration calculates and records image pull duration from pod status
+func recordImagePullDuration(namespace string, pod *corev1.Pod) {
+	podCreated := pod.CreationTimestamp.Time
+
+	// Check all containers for image pull timing
+	for _, cs := range pod.Status.ContainerStatuses {
+		if cs.State.Running != nil && cs.State.Running.StartedAt.Time.After(podCreated) {
+			// Approximate image pull duration as time from pod creation to container start
+			// This includes scheduling + image pull + container creation
+			duration := cs.State.Running.StartedAt.Time.Sub(podCreated).Seconds()
+			
+			// Extract image name (remove tag/digest for cleaner metrics)
+			image := cs.Image
+			if idx := strings.Index(image, "@"); idx != -1 {
+				image = image[:idx]
+			} else if idx := strings.LastIndex(image, ":"); idx != -1 {
+				image = image[:idx]
+			}
+			
+			RecordImagePullDuration(namespace, image, duration)
+			
+			// Log for first container only (usually the runner)
+			log.Log.Info("Image pull completed",
+				"namespace", namespace,
+				"pod", pod.Name,
+				"image", image,
+				"duration_seconds", fmt.Sprintf("%.2f", duration),
+			)
+			break // Only record for first container
+		}
+	}
+}
+
+// recordStartupTime calculates and records the startup duration from startTime to now
+func recordStartupTime(namespace, sessionName string, session *unstructured.Unstructured) {
+	status, _, _ := unstructured.NestedMap(session.Object, "status")
+	if status == nil {
+		return
+	}
+
+	// Get startTime from status
+	startTimeStr, ok := status["startTime"].(string)
+	if !ok || startTimeStr == "" {
+		return
+	}
+
+	startTime, err := time.Parse(time.RFC3339, startTimeStr)
+	if err != nil {
+		return
+	}
+
+	duration := time.Since(startTime).Seconds()
+	RecordSessionStartupDuration(namespace, duration)
+
+	// Log the startup time for visibility
+	log.Log.Info("Session started",
+		"namespace", namespace,
+		"session", sessionName,
+		"startup_duration_seconds", fmt.Sprintf("%.2f", duration),
+	)
+}
+
+
+// reconcilePending handles sessions in Pending phase.
+// This creates the runner pod and transitions to Creating phase.
+func (r *AgenticSessionReconciler) reconcilePending(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) {
+	logger := log.FromContext(ctx)
+	name := session.GetName()
+	namespace := session.GetNamespace()
+
+	logger.Info("Processing Pending session", "name", name, "namespace", namespace)
+
+	// Record that a new session is being processed
+	recordSessionCreated(namespace, session)
+
+	// Check for desired-phase annotation (user-requested state transitions)
+	annotations := session.GetAnnotations()
+	desiredPhase := ""
+	if annotations != nil {
+		desiredPhase = strings.TrimSpace(annotations["ambient-code.io/desired-phase"])
+	}
+
+	// If user wants to stop, don't create pod
+	if desiredPhase == "Stopped" {
+		logger.Info("Session has desired-phase=Stopped, skipping pod creation", "name", name)
+		recordPhaseTransition(namespace, "Pending", "Stopped")
+		recordSessionCompleted(namespace, "Stopped", session)
+		if err := handlers.TransitionToStopped(ctx, session); err != nil {
+			logger.Error(err, "Failed to transition to Stopped", "name", name)
+			return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+		}
+		return ctrl.Result{}, nil
+	}
+
+	// Delegate to existing handler logic (refactored to be called from here)
+	// This preserves all the existing pod creation, secret handling, etc.
+	if err := handlers.ReconcilePendingSession(ctx, session, r.appConfig); err != nil {
+		logger.Error(err, "Failed to reconcile pending session", "name", name)
+		RecordReconcileRetry(namespace, "Pending")
+		// Requeue with backoff
+		return ctrl.Result{RequeueAfter: 10 * time.Second}, err
+	}
+
+	// Pod created - record transition to Creating
+	recordPhaseTransition(namespace, "Pending", "Creating")
+
+	// Pod created - requeue quickly to monitor startup
+	// Pod typically ready in ~1s, so check after 2s
+	return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+}
+
+// reconcileCreating handles sessions in Creating phase.
+// This monitors pod startup and transitions to Running when ready.
+func (r *AgenticSessionReconciler) reconcileCreating(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) {
+	logger := log.FromContext(ctx)
+	name := session.GetName()
+	namespace := session.GetNamespace()
+	podName := fmt.Sprintf("%s-runner", name)
+
+	// Check if pod exists
+	pod := &corev1.Pod{}
+	err := r.Get(ctx, types.NamespacedName{Name: podName, Namespace: namespace}, pod)
+	if err != nil {
+		if errors.IsNotFound(err) {
+			// Pod doesn't exist - check if stop was requested
+			annotations := session.GetAnnotations()
+			if annotations != nil && annotations["ambient-code.io/desired-phase"] == "Stopped" {
+				logger.Info("Pod gone and stop requested, transitioning to Stopped", "name", name)
+				recordPhaseTransition(namespace, "Creating", "Stopped")
+				recordSessionCompleted(namespace, "Stopped", session)
+				if err := handlers.TransitionToStopped(ctx, session); err != nil {
+					return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+				}
+				return ctrl.Result{}, nil
+			}
+
+			// Pod missing unexpectedly - reset to Pending
+			logger.Info("Pod missing in Creating phase, resetting to Pending", "name", name)
+			RecordReconcileRetry(namespace, "Creating")
+			if err := handlers.ResetToPending(ctx, session); err != nil {
+				return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+			}
+			return ctrl.Result{Requeue: true}, nil
+		}
+		return ctrl.Result{}, fmt.Errorf("failed to get pod: %w", err)
+	}
+
+	// Check pod status and update session accordingly
+	if err := handlers.UpdateSessionFromPodStatus(ctx, session, pod); err != nil {
+		logger.Error(err, "Failed to update session from pod status", "name", name)
+		return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+	}
+
+	// Re-fetch session to get updated status
+	updatedSession := &unstructured.Unstructured{}
+	updatedSession.SetGroupVersionKind(session.GroupVersionKind())
+	if err := r.Get(ctx, types.NamespacedName{Name: name, Namespace: namespace}, updatedSession); err != nil {
+		if errors.IsNotFound(err) {
+			logger.Info("Session deleted during reconciliation", "name", name)
+			return ctrl.Result{}, nil
+		}
+		return ctrl.Result{}, err
+	}
+
+	// Check if phase changed
+	newStatus, found, _ := unstructured.NestedMap(updatedSession.Object, "status")
+	if !found {
+		logger.V(1).Info("Session has no status yet", "name", name)
+		return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+	}
+	newPhase, _, _ := unstructured.NestedString(newStatus, "phase")
+
+	if newPhase == "Running" {
+		// Record transition and startup time
+		recordPhaseTransition(namespace, "Creating", "Running")
+		recordStartupTime(namespace, name, updatedSession)
+		// Record image pull duration from pod
+		recordImagePullDuration(namespace, pod)
+		return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
+	}
+
+	if newPhase == "Failed" {
+		recordPhaseTransition(namespace, "Creating", "Failed")
+		recordSessionCompleted(namespace, "Failed", updatedSession)
+		return ctrl.Result{}, nil
+	}
+
+	// Still creating - requeue to continue monitoring
+	// Use short interval since pods typically start quickly
+	return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+}
+
+// reconcileRunning handles sessions in Running phase.
+// This monitors for stop requests and token refresh.
+func (r *AgenticSessionReconciler) reconcileRunning(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) {
+	logger := log.FromContext(ctx)
+	name := session.GetName()
+	namespace := session.GetNamespace()
+	podName := fmt.Sprintf("%s-runner", name)
+
+	// Check if pod still exists
+	pod := &corev1.Pod{}
+	err := r.Get(ctx, types.NamespacedName{Name: podName, Namespace: namespace}, pod)
+	if err != nil {
+		if errors.IsNotFound(err) {
+			// Pod deleted unexpectedly while Running - reset to Pending to recreate
+			logger.Info("Pod missing during Running phase, resetting to Pending", "name", name)
+			RecordReconcileRetry(namespace, "Running")
+			if err := handlers.ResetToPending(ctx, session); err != nil {
+				return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+			}
+			return ctrl.Result{Requeue: true}, nil
+		}
+		return ctrl.Result{}, fmt.Errorf("failed to get pod: %w", err)
+	}
+
+	// Check for desired-phase annotation (user-requested stop)
+	annotations := session.GetAnnotations()
+	desiredPhase := ""
+	if annotations != nil {
+		desiredPhase = strings.TrimSpace(annotations["ambient-code.io/desired-phase"])
+	}
+
+	if desiredPhase == "Stopped" {
+		logger.Info("Stop requested for running session", "name", name)
+		recordPhaseTransition(namespace, "Running", "Stopping")
+		if err := handlers.InitiateStop(ctx, session); err != nil {
+			logger.Error(err, "Failed to initiate stop", "name", name)
+			return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+		}
+		return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+	}
+
+	// Check for generation drift (spec changed)
+	status, _, _ := unstructured.NestedMap(session.Object, "status")
+	observedGen, _, _ := unstructured.NestedInt64(status, "observedGeneration")
+	currentGen := session.GetGeneration()
+
+	if currentGen != observedGen && observedGen != 0 {
+		logger.Info("Generation drift detected, reconciling spec",
+			"name", name,
+			"current", currentGen,
+			"observed", observedGen,
+		)
+		// Handle spec updates while running
+		if err := handlers.ReconcileSpecChanges(ctx, session); err != nil {
+			logger.Error(err, "Failed to reconcile running session spec", "name", name)
+		}
+	}
+
+	// Refresh runner token if needed
+	if err := handlers.EnsureFreshRunnerToken(ctx, session); err != nil {
+		logger.Error(err, "Failed to refresh runner token", "name", name)
+		// Non-fatal, continue
+	}
+
+	// Requeue to continue monitoring
+	return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
+}
+
+// reconcileStopping handles sessions in Stopping phase.
+// This waits for pod deletion and transitions to Stopped.
+func (r *AgenticSessionReconciler) reconcileStopping(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) {
+	logger := log.FromContext(ctx)
+	name := session.GetName()
+	namespace := session.GetNamespace()
+	podName := fmt.Sprintf("%s-runner", name)
+
+	// Check if pod still exists
+	pod := &corev1.Pod{}
+	err := r.Get(ctx, types.NamespacedName{Name: podName, Namespace: namespace}, pod)
+	if err != nil {
+		if errors.IsNotFound(err) {
+			// Pod is gone - transition to Stopped
+			logger.Info("Pod deleted, transitioning to Stopped", "name", name)
+			recordPhaseTransition(namespace, "Stopping", "Stopped")
+			recordSessionCompleted(namespace, "Stopped", session)
+			if err := handlers.TransitionToStopped(ctx, session); err != nil {
+				return ctrl.Result{RequeueAfter: 5 * time.Second}, err
+			}
+			return ctrl.Result{}, nil
+		}
+		return ctrl.Result{}, fmt.Errorf("failed to get pod: %w", err)
+	}
+
+	// Pod still exists - try to delete it
+	logger.Info("Pod still exists in Stopping phase, deleting", "name", name, "pod", podName)
+	if err := handlers.DeletePodAndServices(ctx, namespace, podName, name); err != nil {
+		logger.Error(err, "Failed to delete pod", "name", name)
+	}
+
+	// Requeue to check again
+	return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
+}
+
diff --git a/components/operator/internal/handlers/helpers.go b/components/operator/internal/handlers/helpers.go
index 059c0ca9c..0655fc222 100644
--- a/components/operator/internal/handlers/helpers.go
+++ b/components/operator/internal/handlers/helpers.go
@@ -19,22 +19,17 @@ import (
 const (
 	// Progress tracking conditions - these track the session's lifecycle stages
 	conditionReady                     = "Ready"
-	conditionPVCReady                  = "PVCReady"
 	conditionSecretsReady              = "SecretsReady"
-	conditionJobCreated                = "JobCreated"
+	conditionPodCreated                = "PodCreated"
 	conditionPodScheduled              = "PodScheduled"
 	conditionRunnerStarted             = "RunnerStarted"
 	conditionReposReconciled           = "ReposReconciled"
 	conditionWorkflowReconciled        = "WorkflowReconciled"
-	conditionTempContentPodReady       = "TempContentPodReady"
 	conditionReconciled                = "Reconciled"
 	runnerTokenSecretAnnotation        = "ambient-code.io/runner-token-secret"
 	runnerServiceAccountAnnotation     = "ambient-code.io/runner-sa"
 	runnerTokenRefreshedAtAnnotation   = "ambient-code.io/token-refreshed-at"
-	tempContentRequestedAnnotation     = "ambient-code.io/temp-content-requested"
-	tempContentLastAccessedAnnotation  = "ambient-code.io/temp-content-last-accessed"
 	runnerTokenRefreshTTL              = 45 * time.Minute
-	tempContentInactivityTTL           = 10 * time.Minute
 	defaultRunnerTokenSecretPrefix     = "ambient-runner-token-"
 	defaultSessionServiceAccountPrefix = "ambient-session-"
 )
diff --git a/components/operator/internal/handlers/namespaces.go b/components/operator/internal/handlers/namespaces.go
index 7f0997c6a..aea26d8b7 100644
--- a/components/operator/internal/handlers/namespaces.go
+++ b/components/operator/internal/handlers/namespaces.go
@@ -6,7 +6,6 @@ import (
 	"time"
 
 	"ambient-code-operator/internal/config"
-	"ambient-code-operator/internal/services"
 
 	corev1 "k8s.io/api/core/v1"
 	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
@@ -39,10 +38,8 @@ func WatchNamespaces() {
 					log.Printf("Error creating default ProjectSettings for namespace %s: %v", namespace.Name, err)
 				}
 
-				// Ensure shared workspace PVC exists
-				if err := services.EnsureProjectWorkspacePVC(namespace.Name); err != nil {
-					log.Printf("Failed to ensure workspace PVC in %s: %v", namespace.Name, err)
-				}
+				// PVC creation removed - sessions now use EmptyDir with S3 state persistence
+				log.Printf("Namespace %s ready (using EmptyDir + S3 for session storage)", namespace.Name)
 			case watch.Error:
 				obj := event.Object.(*unstructured.Unstructured)
 				log.Printf("Watch error for namespaces: %v", obj)
diff --git a/components/operator/internal/handlers/reconciler.go b/components/operator/internal/handlers/reconciler.go
new file mode 100644
index 000000000..de10f5b42
--- /dev/null
+++ b/components/operator/internal/handlers/reconciler.go
@@ -0,0 +1,450 @@
+// Package handlers provides exported reconciliation functions for the controller.
+// These functions are called from the controller-runtime reconciler and contain
+// the actual business logic for managing AgenticSession resources.
+package handlers
+
+import (
+	"context"
+	"fmt"
+	"log"
+	"time"
+
+	corev1 "k8s.io/api/core/v1"
+	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
+
+	"ambient-code-operator/internal/config"
+	"ambient-code-operator/internal/types"
+)
+
+// ReconcilePendingSession handles the Pending phase - creates pod and services.
+// This is the main entry point called from the controller for pending sessions.
+func ReconcilePendingSession(ctx context.Context, session *unstructured.Unstructured, appConfig *config.Config) error {
+	// Delegate to existing handleAgenticSessionEvent logic
+	// This is a wrapper that allows the existing code to be called from the controller
+	return handleAgenticSessionEvent(session)
+}
+
+// ResetToPending transitions a session back to Pending phase.
+func ResetToPending(ctx context.Context, session *unstructured.Unstructured) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+
+	statusPatch := NewStatusPatch(namespace, name)
+	statusPatch.SetField("phase", "Pending")
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionPodCreated,
+		Status:  "False",
+		Reason:  "PodMissing",
+		Message: "Pod not found, will recreate",
+	})
+
+	return statusPatch.Apply()
+}
+
+// TransitionToStopped transitions a session to Stopped phase.
+func TransitionToStopped(ctx context.Context, session *unstructured.Unstructured) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+
+	statusPatch := NewStatusPatch(namespace, name)
+	statusPatch.SetField("phase", "Stopped")
+	statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionReady,
+		Status:  "False",
+		Reason:  "UserStopped",
+		Message: "Session stopped by user",
+	})
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionPodCreated,
+		Status:  "False",
+		Reason:  "UserStopped",
+		Message: "Pod deleted by user stop request",
+	})
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionRunnerStarted,
+		Status:  "False",
+		Reason:  "UserStopped",
+		Message: "Runner stopped by user",
+	})
+
+	if err := statusPatch.Apply(); err != nil {
+		return err
+	}
+
+	// Clear annotations
+	_ = clearAnnotation(namespace, name, "ambient-code.io/desired-phase")
+	_ = clearAnnotation(namespace, name, "ambient-code.io/stop-requested-at")
+
+	// Cleanup secrets
+	deleteCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+	defer cancel()
+	_ = deleteAmbientVertexSecret(deleteCtx, namespace)
+	_ = deleteAmbientLangfuseSecret(deleteCtx, namespace)
+
+	return nil
+}
+
+// TransitionToFailed transitions a session to Failed phase with an error message.
+func TransitionToFailed(ctx context.Context, session *unstructured.Unstructured, errorMsg string) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+
+	statusPatch := NewStatusPatch(namespace, name)
+	statusPatch.SetField("phase", "Failed")
+	statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionReady,
+		Status:  "False",
+		Reason:  "PodFailed",
+		Message: errorMsg,
+	})
+
+	if err := statusPatch.Apply(); err != nil {
+		return err
+	}
+
+	_ = ensureSessionIsInteractive(namespace, name)
+
+	return nil
+}
+
+// InitiateStop starts the stop process for a running session.
+func InitiateStop(ctx context.Context, session *unstructured.Unstructured) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+	podName := fmt.Sprintf("%s-runner", name)
+
+	log.Printf("[Stop] Initiating stop for session %s/%s", namespace, name)
+
+	// Delete the pod
+	if err := DeletePodAndServices(ctx, namespace, podName, name); err != nil {
+		log.Printf("[Stop] Warning: failed to delete pod: %v", err)
+	}
+
+	// Transition to Stopping phase
+	statusPatch := NewStatusPatch(namespace, name)
+	statusPatch.SetField("phase", "Stopping")
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionReady,
+		Status:  "False",
+		Reason:  "Stopping",
+		Message: "Session is stopping",
+	})
+
+	return statusPatch.Apply()
+}
+
+// ReconcileSpecChanges handles spec updates for a running session.
+func ReconcileSpecChanges(ctx context.Context, session *unstructured.Unstructured) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+
+	spec, _, _ := unstructured.NestedMap(session.Object, "spec")
+	statusPatch := NewStatusPatch(namespace, name)
+
+	// Reconcile repos
+	if err := reconcileSpecReposWithPatch(namespace, name, spec, session, statusPatch); err != nil {
+		log.Printf("[Reconcile] Failed to reconcile repos for %s/%s: %v", namespace, name, err)
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionReconciled,
+			Status:  "False",
+			Reason:  "RepoReconciliationFailed",
+			Message: fmt.Sprintf("Failed to reconcile repos: %v", err),
+		})
+		_ = statusPatch.Apply()
+		return err
+	}
+
+	// Reconcile workflow
+	if err := reconcileActiveWorkflowWithPatch(namespace, name, spec, session, statusPatch); err != nil {
+		log.Printf("[Reconcile] Failed to reconcile workflow for %s/%s: %v", namespace, name, err)
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionReconciled,
+			Status:  "False",
+			Reason:  "WorkflowReconciliationFailed",
+			Message: fmt.Sprintf("Failed to reconcile workflow: %v", err),
+		})
+		_ = statusPatch.Apply()
+		return err
+	}
+
+	// Update observedGeneration
+	statusPatch.SetField("observedGeneration", session.GetGeneration())
+	statusPatch.AddCondition(conditionUpdate{
+		Type:    conditionReconciled,
+		Status:  "True",
+		Reason:  "SpecApplied",
+		Message: fmt.Sprintf("Successfully reconciled generation %d", session.GetGeneration()),
+	})
+
+	return statusPatch.Apply()
+}
+
+// UpdateSessionFromPodStatus updates the session status based on pod state.
+func UpdateSessionFromPodStatus(ctx context.Context, session *unstructured.Unstructured, pod *corev1.Pod) error {
+	namespace := session.GetNamespace()
+	name := session.GetName()
+	podName := pod.Name
+
+	statusPatch := NewStatusPatch(namespace, name)
+
+	// Check if pod is scheduled
+	if pod.Spec.NodeName != "" {
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionPodScheduled,
+			Status:  "True",
+			Reason:  "Scheduled",
+			Message: fmt.Sprintf("Scheduled on %s", pod.Spec.NodeName),
+		})
+	}
+
+	switch pod.Status.Phase {
+	case corev1.PodSucceeded:
+		statusPatch.SetField("phase", "Completed")
+		statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionReady,
+			Status:  "False",
+			Reason:  "Completed",
+			Message: "Session finished",
+		})
+		if err := statusPatch.Apply(); err != nil {
+			return err
+		}
+		_ = ensureSessionIsInteractive(namespace, name)
+		return DeletePodAndServices(ctx, namespace, podName, name)
+
+	case corev1.PodFailed:
+		errorMsg := collectPodErrorMessage(pod)
+		log.Printf("Pod %s failed: %s", podName, errorMsg)
+		statusPatch.SetField("phase", "Failed")
+		statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionReady,
+			Status:  "False",
+			Reason:  "PodFailed",
+			Message: errorMsg,
+		})
+		if err := statusPatch.Apply(); err != nil {
+			return err
+		}
+		_ = ensureSessionIsInteractive(namespace, name)
+		return DeletePodAndServices(ctx, namespace, podName, name)
+	}
+
+	// Check runner container status
+	runner := getContainerStatusByName(pod, "ambient-code-runner")
+	if runner == nil {
+		return statusPatch.Apply()
+	}
+
+	if runner.State.Running != nil {
+		statusPatch.SetField("phase", "Running")
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionRunnerStarted,
+			Status:  "True",
+			Reason:  "ContainerRunning",
+			Message: "Runner container is executing",
+		})
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    conditionReady,
+			Status:  "True",
+			Reason:  "Running",
+			Message: "Session is running",
+		})
+		return statusPatch.Apply()
+	}
+
+	if runner.State.Waiting != nil {
+		waiting := runner.State.Waiting
+		errorStates := map[string]bool{
+			"ImagePullBackOff":           true,
+			"ErrImagePull":               true,
+			"CrashLoopBackOff":           true,
+			"CreateContainerConfigError": true,
+			"InvalidImageName":           true,
+		}
+		if errorStates[waiting.Reason] {
+			msg := fmt.Sprintf("Runner waiting: %s - %s", waiting.Reason, waiting.Message)
+			statusPatch.SetField("phase", "Failed")
+			statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
+			statusPatch.AddCondition(conditionUpdate{
+				Type:    conditionReady,
+				Status:  "False",
+				Reason:  waiting.Reason,
+				Message: msg,
+			})
+			if err := statusPatch.Apply(); err != nil {
+				return err
+			}
+			_ = ensureSessionIsInteractive(namespace, name)
+			return DeletePodAndServices(ctx, namespace, podName, name)
+		}
+	}
+
+	if runner.State.Terminated != nil {
+		term := runner.State.Terminated
+		now := time.Now().UTC().Format(time.RFC3339)
+		statusPatch.SetField("completionTime", now)
+
+		switch term.ExitCode {
+		case 0:
+			statusPatch.SetField("phase", "Completed")
+			statusPatch.AddCondition(conditionUpdate{
+				Type:    conditionReady,
+				Status:  "False",
+				Reason:  "Completed",
+				Message: "Runner finished",
+			})
+		case 2:
+			msg := fmt.Sprintf("Runner exited due to prerequisite failure: %s", term.Message)
+			statusPatch.SetField("phase", "Failed")
+			statusPatch.AddCondition(conditionUpdate{
+				Type:    conditionReady,
+				Status:  "False",
+				Reason:  "PrerequisiteFailed",
+				Message: msg,
+			})
+		default:
+			msg := fmt.Sprintf("Runner exited with code %d: %s", term.ExitCode, term.Reason)
+			if term.Message != "" {
+				msg = fmt.Sprintf("%s - %s", msg, term.Message)
+			}
+			statusPatch.SetField("phase", "Failed")
+			statusPatch.AddCondition(conditionUpdate{
+				Type:    conditionReady,
+				Status:  "False",
+				Reason:  "RunnerExit",
+				Message: msg,
+			})
+		}
+
+		if err := statusPatch.Apply(); err != nil {
+			return err
+		}
+		_ = ensureSessionIsInteractive(namespace, name)
+		return DeletePodAndServices(ctx, namespace, podName, name)
+	}
+
+	return statusPatch.Apply()
+}
+
+// DeletePodAndServices deletes the pod and associated services.
+func DeletePodAndServices(ctx context.Context, namespace, podName, sessionName string) error {
+	return deletePodAndPerPodService(namespace, podName, sessionName)
+}
+
+// EnsureFreshRunnerToken refreshes the runner token if needed.
+func EnsureFreshRunnerToken(ctx context.Context, session *unstructured.Unstructured) error {
+	return ensureFreshRunnerToken(ctx, session)
+}
+
+// collectPodErrorMessage extracts detailed error information from a failed pod.
+func collectPodErrorMessage(pod *corev1.Pod) string {
+	errorMsg := pod.Status.Message
+	if errorMsg == "" {
+		errorMsg = pod.Status.Reason
+	}
+
+	// Check init containers for errors
+	for _, initStatus := range pod.Status.InitContainerStatuses {
+		if initStatus.State.Terminated != nil && initStatus.State.Terminated.ExitCode != 0 {
+			msg := fmt.Sprintf("Init container %s failed (exit %d): %s",
+				initStatus.Name,
+				initStatus.State.Terminated.ExitCode,
+				initStatus.State.Terminated.Message)
+			if initStatus.State.Terminated.Reason != "" {
+				msg = fmt.Sprintf("%s - %s", msg, initStatus.State.Terminated.Reason)
+			}
+			return msg
+		}
+		if initStatus.State.Waiting != nil && initStatus.State.Waiting.Reason != "" {
+			return fmt.Sprintf("Init container %s: %s - %s",
+				initStatus.Name,
+				initStatus.State.Waiting.Reason,
+				initStatus.State.Waiting.Message)
+		}
+	}
+
+	// Check main containers for errors if init passed
+	if errorMsg == "" || errorMsg == "PodFailed" {
+		for _, containerStatus := range pod.Status.ContainerStatuses {
+			if containerStatus.State.Terminated != nil && containerStatus.State.Terminated.ExitCode != 0 {
+				return fmt.Sprintf("Container %s failed (exit %d): %s - %s",
+					containerStatus.Name,
+					containerStatus.State.Terminated.ExitCode,
+					containerStatus.State.Terminated.Reason,
+					containerStatus.State.Terminated.Message)
+			}
+			if containerStatus.State.Waiting != nil {
+				return fmt.Sprintf("Container %s: %s - %s",
+					containerStatus.Name,
+					containerStatus.State.Waiting.Reason,
+					containerStatus.State.Waiting.Message)
+			}
+		}
+	}
+
+	if errorMsg == "" {
+		errorMsg = "Pod failed with unknown error"
+	}
+
+	return errorMsg
+}
+
+// WatchAgenticSessionsLegacy is the original watch-based implementation.
+// This is kept for backward compatibility during migration.
+// DEPRECATED: Use controller-runtime based reconciliation instead.
+func WatchAgenticSessionsLegacy() {
+	gvr := types.GetAgenticSessionResource()
+
+	for {
+		// Watch AgenticSessions across all namespaces
+		watcher, err := config.DynamicClient.Resource(gvr).Watch(context.TODO(), v1.ListOptions{})
+		if err != nil {
+			log.Printf("Failed to create AgenticSession watcher: %v", err)
+			time.Sleep(5 * time.Second)
+			continue
+		}
+
+		log.Println("Watching for AgenticSession events across all namespaces...")
+
+		for event := range watcher.ResultChan() {
+			// Reduced logging - only log errors and key events
+			switch event.Type {
+			case "ADDED", "MODIFIED":
+				obj := event.Object.(*unstructured.Unstructured)
+
+				// Only process resources in managed namespaces
+				ns := obj.GetNamespace()
+				if ns == "" {
+					continue
+				}
+				nsObj, err := config.K8sClient.CoreV1().Namespaces().Get(context.TODO(), ns, v1.GetOptions{})
+				if err != nil {
+					continue
+				}
+				if nsObj.Labels["ambient-code.io/managed"] != "true" {
+					continue
+				}
+
+				// Remove the 100ms delay - controller-runtime handles debouncing
+				if err := handleAgenticSessionEvent(obj); err != nil {
+					log.Printf("Error handling AgenticSession event: %v", err)
+				}
+			case "DELETED":
+				obj := event.Object.(*unstructured.Unstructured)
+				log.Printf("AgenticSession %s/%s deleted", obj.GetNamespace(), obj.GetName())
+			case "ERROR":
+				obj := event.Object.(*unstructured.Unstructured)
+				log.Printf("Watch error for AgenticSession: %v", obj)
+			}
+		}
+
+		log.Println("AgenticSession watch channel closed, restarting...")
+		watcher.Stop()
+		time.Sleep(2 * time.Second)
+	}
+}
diff --git a/components/operator/internal/handlers/sessions.go b/components/operator/internal/handlers/sessions.go
index 34eb73f58..a7d412a2d 100644
--- a/components/operator/internal/handlers/sessions.go
+++ b/components/operator/internal/handlers/sessions.go
@@ -15,14 +15,13 @@ import (
 	"time"
 
 	"ambient-code-operator/internal/config"
-	"ambient-code-operator/internal/services"
 	"ambient-code-operator/internal/types"
 
 	authnv1 "k8s.io/api/authentication/v1"
-	batchv1 "k8s.io/api/batch/v1"
 	corev1 "k8s.io/api/core/v1"
 	rbacv1 "k8s.io/api/rbac/v1"
 	"k8s.io/apimachinery/pkg/api/errors"
+	"k8s.io/apimachinery/pkg/api/resource"
 	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	intstr "k8s.io/apimachinery/pkg/util/intstr"
@@ -30,13 +29,13 @@ import (
 	"k8s.io/client-go/util/retry"
 )
 
-// Track which jobs are currently being monitored to prevent duplicate goroutines
+// Track which pods are currently being monitored to prevent duplicate goroutines
 var (
-	monitoredJobs   = make(map[string]bool)
-	monitoredJobsMu sync.Mutex
+	monitoredPods   = make(map[string]bool)
+	monitoredPodsMu sync.Mutex
 )
 
-// WatchAgenticSessions watches for AgenticSession custom resources and creates jobs
+// WatchAgenticSessions watches for AgenticSession custom resources and creates pods
 func WatchAgenticSessions() {
 	gvr := types.GetAgenticSessionResource()
 
@@ -128,6 +127,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	// If status.phase is missing, treat as Pending and initialize it
 	if phase == "" {
 		statusPatch.SetField("phase", "Pending")
+		statusPatch.SetField("startTime", time.Now().UTC().Format(time.RFC3339))
 		if err := statusPatch.ApplyAndReset(); err != nil {
 			log.Printf("Warning: failed to initialize phase: %v", err)
 		}
@@ -150,25 +150,13 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	if desiredPhase == "Running" && phase != "Running" && phase != "Creating" && phase != "Pending" {
 		log.Printf("[DesiredPhase] Session %s/%s: user requested start/restart (current=%s → desired=Running)", sessionNamespace, name, phase)
 
-		// Delete temp pod if it exists (to free PVC for job)
-		tempPodName := fmt.Sprintf("temp-content-%s", name)
-		if _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{}); err == nil {
-			log.Printf("[DesiredPhase] Deleting temp pod %s to free PVC for job", tempPodName)
-			if err := config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), tempPodName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-				log.Printf("[DesiredPhase] Warning: failed to delete temp pod: %v", err)
-			}
-			// Clear temp pod annotations
-			_ = clearAnnotation(sessionNamespace, name, tempContentRequestedAnnotation)
-			_ = clearAnnotation(sessionNamespace, name, tempContentLastAccessedAnnotation)
-		}
-
-		// Delete old job if it exists (from previous run)
-		jobName := fmt.Sprintf("%s-job", name)
-		_, err = config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+		// Delete old pod if it exists (from previous run)
+		podName := fmt.Sprintf("%s-runner", name)
+		_, err = config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 		if err == nil {
-			log.Printf("[DesiredPhase] Cleaning up old job %s before restart", jobName)
-			if err := deleteJobAndPerJobService(sessionNamespace, jobName, name); err != nil {
-				log.Printf("[DesiredPhase] Warning: failed to cleanup old job: %v", err)
+			log.Printf("[DesiredPhase] Cleaning up old pod %s before restart", podName)
+			if err := deletePodAndPerPodService(sessionNamespace, podName, name); err != nil {
+				log.Printf("[DesiredPhase] Warning: failed to cleanup old pod: %v", err)
 			}
 		} else if !errors.IsNotFound(err) {
 			log.Printf("[DesiredPhase] Error checking for old job: %v", err)
@@ -217,10 +205,10 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	if desiredPhase == "Stopped" && (phase == "Running" || phase == "Creating") {
 		log.Printf("[DesiredPhase] Session %s/%s: user requested stop (current=%s → desired=Stopped)", sessionNamespace, name, phase)
 
-		// Delete running job (this triggers pod deletion via OwnerReferences)
-		jobName := fmt.Sprintf("%s-job", name)
-		if err := deleteJobAndPerJobService(sessionNamespace, jobName, name); err != nil {
-			log.Printf("[DesiredPhase] Warning: failed to delete job: %v", err)
+		// Delete running pod
+		podName := fmt.Sprintf("%s-runner", name)
+		if err := deletePodAndPerPodService(sessionNamespace, podName, name); err != nil {
+			log.Printf("[DesiredPhase] Warning: failed to delete pod: %v", err)
 		}
 
 		// Set phase=Stopping explicitly (transitional state)
@@ -244,22 +232,22 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	// === STOPPING PHASE HANDLER ===
 	// Complete the stop transition: verify cleanup and transition to Stopped
 	if phase == "Stopping" {
-		jobName := fmt.Sprintf("%s-job", name)
-		_, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+		podName := fmt.Sprintf("%s-runner", name)
+		_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 
 		if errors.IsNotFound(err) {
-			// Job is gone - safe to transition to Stopped
-			log.Printf("[Stopping] Session %s/%s: job deleted, transitioning to Stopped", sessionNamespace, name)
+			// Pod is gone - safe to transition to Stopped
+			log.Printf("[Stopping] Session %s/%s: pod deleted, transitioning to Stopped", sessionNamespace, name)
 
 			// Set phase=Stopped explicitly
 			statusPatch.SetField("phase", "Stopped")
 			statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
 			// Update progress-tracking conditions to reflect stopped state
 			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionJobCreated,
+				Type:    conditionPodCreated,
 				Status:  "False",
 				Reason:  "UserStopped",
-				Message: "Job deleted by user stop request",
+				Message: "Pod deleted by user stop request",
 			})
 			statusPatch.AddCondition(conditionUpdate{
 				Type:    conditionRunnerStarted,
@@ -278,13 +266,13 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 			log.Printf("[Stopping] Session %s/%s: transitioned to Stopped", sessionNamespace, name)
 		} else if err != nil {
-			// Error checking job - log and retry next reconciliation
-			log.Printf("[Stopping] Session %s/%s: error checking job status: %v", sessionNamespace, name, err)
+			// Error checking pod - log and retry next reconciliation
+			log.Printf("[Stopping] Session %s/%s: error checking pod status: %v", sessionNamespace, name, err)
 		} else {
-			// Job still exists - try to delete it again
-			log.Printf("[Stopping] Session %s/%s: job still exists, deleting", sessionNamespace, name)
-			if err := deleteJobAndPerJobService(sessionNamespace, jobName, name); err != nil {
-				log.Printf("[Stopping] Warning: failed to delete job: %v", err)
+			// Pod still exists - try to delete it again
+			log.Printf("[Stopping] Session %s/%s: pod still exists, deleting", sessionNamespace, name)
+			if err := deletePodAndPerPodService(sessionNamespace, podName, name); err != nil {
+				log.Printf("[Stopping] Warning: failed to delete pod: %v", err)
 			}
 			// Will retry on next reconciliation
 		}
@@ -294,133 +282,59 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	// === TEMP CONTENT POD RECONCILIATION ===
 	// Manage temporary content pods for workspace access when runner is not active
 
-	tempContentRequested := annotations != nil && annotations[tempContentRequestedAnnotation] == "true"
-	tempPodName := fmt.Sprintf("temp-content-%s", name)
+	// Temp-content pods removed - users view artifacts directly from S3 bucket
+	// Session state and artifacts persist in S3, accessible via bucket browser or CLI
 
-	// Manage temp pods for:
-	// - Pending sessions (for pre-upload before runner starts)
-	// - Stopped/Completed/Failed sessions (for post-session workspace access)
-	// Do NOT create temp pods for Running/Creating sessions (they have ambient-content service)
+	// Early exit for terminal phases - no reconciliation needed
 	if phase == "Stopped" || phase == "Completed" || phase == "Failed" {
-		if tempContentRequested {
-			// User wants workspace access - ensure temp pod exists
-			if err := reconcileTempContentPodWithPatch(sessionNamespace, name, tempPodName, currentObj, statusPatch); err != nil {
-				log.Printf("[TempPod] Failed to reconcile temp pod: %v", err)
-			}
-		} else {
-			// Temp pod not requested - delete if it exists
-			_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{})
-			if err == nil {
-				log.Printf("[TempPod] Deleting unrequested temp pod: %s", tempPodName)
-				if err := config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), tempPodName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-					log.Printf("[TempPod] Failed to delete temp pod: %v", err)
-				} else {
-					statusPatch.AddCondition(conditionUpdate{
-						Type:    conditionTempContentPodReady,
-						Status:  "False",
-						Reason:  "NotRequested",
-						Message: "Temp pod removed (not requested)",
-					})
-				}
-			}
-		}
-		// Apply temp pod status changes and return (no further reconciliation needed for stopped sessions)
-		if statusPatch.HasChanges() {
-			if err := statusPatch.Apply(); err != nil {
-				log.Printf("[TempPod] Warning: failed to apply status patch: %v", err)
-			}
-		}
 		return nil
 	}
 
-	// For Pending sessions: allow temp pod creation for file uploads, but don't return early
-	// This ensures Job creation can proceed when user starts the session
-	if phase == "Pending" {
-		if tempContentRequested {
-			// User wants to upload files - ensure temp pod exists
-			if err := reconcileTempContentPodWithPatch(sessionNamespace, name, tempPodName, currentObj, statusPatch); err != nil {
-				log.Printf("[TempPod] Failed to reconcile temp pod for Pending session: %v", err)
-			}
-			// Apply status changes but CONTINUE to allow Job creation logic below
-			if statusPatch.HasChanges() {
-				if err := statusPatch.Apply(); err != nil {
-					log.Printf("[TempPod] Warning: failed to apply status patch: %v", err)
-				}
-			}
-			// Do NOT return - continue to Job creation logic
-		} else {
-			// Temp pod not requested - delete if it exists
-			_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{})
-			if err == nil {
-				log.Printf("[TempPod] Deleting temp pod from Pending session: %s", tempPodName)
-				if err := config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), tempPodName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-					log.Printf("[TempPod] Failed to delete temp pod: %v", err)
-				}
-			}
-		}
-	}
-
 	// === CONTINUE WITH PHASE-BASED RECONCILIATION ===
 
-	// Early exit: If desired-phase is "Stopped", do not recreate jobs or reconcile
-	// This prevents race conditions where the operator sees the job deleted before phase is updated
+	// Early exit: If desired-phase is "Stopped", do not recreate pods or reconcile
+	// This prevents race conditions where the operator sees the pod deleted before phase is updated
 	if desiredPhase == "Stopped" {
 		log.Printf("Session %s has desired-phase=Stopped, skipping further reconciliation", name)
 		return nil
 	}
 
-	// Handle Stopped phase - clean up running job if it exists
+	// Handle Stopped phase - clean up running pod if it exists
 	if phase == "Stopped" {
-		log.Printf("Session %s is stopped, checking for running job to clean up", name)
-		jobName := fmt.Sprintf("%s-job", name)
+		log.Printf("Session %s is stopped, checking for running pod to clean up", name)
+		podName := fmt.Sprintf("%s-runner", name)
 
-		job, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+		_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 		if err == nil {
-			// Job exists, check if it's still running or needs cleanup
-			if job.Status.Active > 0 || (job.Status.Succeeded == 0 && job.Status.Failed == 0) {
-				log.Printf("Job %s is still active, cleaning up job and pods", jobName)
-
-				// First, delete the job itself with foreground propagation
-				deletePolicy := v1.DeletePropagationForeground
-				err = config.K8sClient.BatchV1().Jobs(sessionNamespace).Delete(context.TODO(), jobName, v1.DeleteOptions{
-					PropagationPolicy: &deletePolicy,
-				})
-				if err != nil && !errors.IsNotFound(err) {
-					log.Printf("Failed to delete job %s: %v", jobName, err)
-				} else {
-					log.Printf("Successfully deleted job %s for stopped session", jobName)
-				}
+			// Pod exists, delete it
+			log.Printf("Pod %s is still active, cleaning up pod", podName)
 
-				// Then, explicitly delete all pods for this job (by job-name label)
-				podSelector := fmt.Sprintf("job-name=%s", jobName)
-				log.Printf("Deleting pods with job-name selector: %s", podSelector)
-				err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{
-					LabelSelector: podSelector,
-				})
-				if err != nil && !errors.IsNotFound(err) {
-					log.Printf("Failed to delete pods for job %s: %v (continuing anyway)", jobName, err)
-				} else {
-					log.Printf("Successfully deleted pods for job %s", jobName)
-				}
+			// Delete the pod
+			deletePolicy := v1.DeletePropagationForeground
+			err = config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), podName, v1.DeleteOptions{
+				PropagationPolicy: &deletePolicy,
+			})
+			if err != nil && !errors.IsNotFound(err) {
+				log.Printf("Failed to delete pod %s: %v", podName, err)
+			} else {
+				log.Printf("Successfully deleted pod %s for stopped session", podName)
+			}
 
-				// Also delete any pods labeled with this session (in case owner refs are lost)
-				sessionPodSelector := fmt.Sprintf("agentic-session=%s", name)
-				log.Printf("Deleting pods with agentic-session selector: %s", sessionPodSelector)
-				err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{
-					LabelSelector: sessionPodSelector,
-				})
-				if err != nil && !errors.IsNotFound(err) {
-					log.Printf("Failed to delete session-labeled pods: %v (continuing anyway)", err)
-				} else {
-					log.Printf("Successfully deleted session-labeled pods")
-				}
+			// Also delete any other pods labeled with this session (in case owner refs are lost)
+			sessionPodSelector := fmt.Sprintf("agentic-session=%s", name)
+			log.Printf("Deleting pods with agentic-session selector: %s", sessionPodSelector)
+			err = config.K8sClient.CoreV1().Pods(sessionNamespace).DeleteCollection(context.TODO(), v1.DeleteOptions{}, v1.ListOptions{
+				LabelSelector: sessionPodSelector,
+			})
+			if err != nil && !errors.IsNotFound(err) {
+				log.Printf("Failed to delete session-labeled pods: %v (continuing anyway)", err)
 			} else {
-				log.Printf("Job %s already completed (Succeeded: %d, Failed: %d), no cleanup needed", jobName, job.Status.Succeeded, job.Status.Failed)
+				log.Printf("Successfully deleted session-labeled pods")
 			}
 		} else if !errors.IsNotFound(err) {
-			log.Printf("Error checking job %s: %v", jobName, err)
+			log.Printf("Error checking pod %s: %v", podName, err)
 		} else {
-			log.Printf("Job %s not found, already cleaned up", jobName)
+			log.Printf("Pod %s not found, already cleaned up", podName)
 		}
 
 		// Also cleanup ambient-vertex secret when session is stopped
@@ -508,25 +422,25 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 	// If in Creating phase, check if job exists
 	if phase == "Creating" {
-		jobName := fmt.Sprintf("%s-job", name)
-		_, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+		podName := fmt.Sprintf("%s-runner", name)
+		_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 		if err == nil {
-			// Job exists, start monitoring if not already running
-			monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName)
-			monitoredJobsMu.Lock()
-			alreadyMonitoring := monitoredJobs[monitorKey]
+			// Pod exists, start monitoring if not already running
+			monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName)
+			monitoredPodsMu.Lock()
+			alreadyMonitoring := monitoredPods[monitorKey]
 			if !alreadyMonitoring {
-				monitoredJobs[monitorKey] = true
-				monitoredJobsMu.Unlock()
-				log.Printf("Resuming monitoring for existing job %s (session in Creating phase)", jobName)
-				go monitorJob(jobName, name, sessionNamespace)
+				monitoredPods[monitorKey] = true
+				monitoredPodsMu.Unlock()
+				log.Printf("Resuming monitoring for existing pod %s (session in Creating phase)", podName)
+				go monitorPod(podName, name, sessionNamespace)
 			} else {
-				monitoredJobsMu.Unlock()
-				log.Printf("Job %s already being monitored, skipping duplicate", jobName)
+				monitoredPodsMu.Unlock()
+				log.Printf("Pod %s already being monitored, skipping duplicate", podName)
 			}
 			return nil
 		} else if errors.IsNotFound(err) {
-			// Job doesn't exist but phase is Creating - check if this is due to a stop request
+			// Pod doesn't exist but phase is Creating - check if this is due to a stop request
 			if desiredPhase == "Stopped" {
 				// Job already gone, can transition directly to Stopped (skip Stopping phase)
 				log.Printf("Session %s in Creating phase but job not found and stop requested, transitioning to Stopped", name)
@@ -537,14 +451,14 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 					Type:    conditionReady,
 					Status:  "False",
 					Reason:  "UserStopped",
-					Message: "User requested stop during job creation",
+					Message: "User requested stop during pod creation",
 				})
 				// Update progress-tracking conditions
 				statusPatch.AddCondition(conditionUpdate{
-					Type:    conditionJobCreated,
+					Type:    conditionPodCreated,
 					Status:  "False",
 					Reason:  "UserStopped",
-					Message: "Job deleted by user stop request",
+					Message: "Pod deleted by user stop request",
 				})
 				statusPatch.AddCondition(conditionUpdate{
 					Type:    conditionRunnerStarted,
@@ -558,11 +472,11 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				return nil
 			}
 
-			// Job doesn't exist but phase is Creating - this is inconsistent state
+			// Pod doesn't exist but phase is Creating - this is inconsistent state
 			// Could happen if:
-			// 1. Job was manually deleted
-			// 2. Operator crashed between job creation and status update
-			// 3. Session is being stopped and job was deleted (stale event)
+			// 1. Pod was manually deleted
+			// 2. Operator crashed between pod creation and status update
+			// 3. Session is being stopped and pod was deleted (stale event)
 
 			// Before recreating, verify the session hasn't been stopped
 			// Fetch fresh status to check for recent state changes
@@ -579,26 +493,26 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				freshStatus, _, _ := unstructured.NestedMap(freshObj.Object, "status")
 				freshPhase, _, _ := unstructured.NestedString(freshStatus, "phase")
 				if freshPhase == "Stopped" || freshPhase == "Stopping" || freshPhase == "Failed" || freshPhase == "Completed" {
-					log.Printf("Session %s is now in %s phase (stale Creating event), skipping job recreation", name, freshPhase)
+					log.Printf("Session %s is now in %s phase (stale Creating event), skipping pod recreation", name, freshPhase)
 					return nil
 				}
 			}
 
-			log.Printf("Session %s in Creating phase but job not found, resetting to Pending and recreating", name)
+			log.Printf("Session %s in Creating phase but pod not found, resetting to Pending and recreating", name)
 			statusPatch.SetField("phase", "Pending")
 			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionJobCreated,
+				Type:    conditionPodCreated,
 				Status:  "False",
-				Reason:  "JobMissing",
-				Message: "Job not found, will recreate",
+				Reason:  "PodMissing",
+				Message: "Pod not found, will recreate",
 			})
 			// Apply immediately and continue to Pending logic
 			_ = statusPatch.ApplyAndReset()
-			// Don't return - fall through to Pending logic to create job
+			// Don't return - fall through to Pending logic to create pod
 			_ = "Pending" // phase reset handled by status update
 		} else {
-			// Error checking job - log and continue
-			log.Printf("Error checking job for Creating session %s: %v, will attempt recovery", name, err)
+			// Error checking pod - log and continue
+			log.Printf("Error checking pod for Creating session %s: %v, will attempt recovery", name, err)
 			// Fall through to Pending logic
 			_ = "Pending" // phase reset handled by status update
 		}
@@ -620,90 +534,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		}
 	}
 
-	// Determine PVC name and owner references
-	var pvcName string
-	var ownerRefs []v1.OwnerReference
-	reusingPVC := false
-
-	if parentSessionID != "" {
-		// Continuation: reuse parent's PVC
-		pvcName = fmt.Sprintf("ambient-workspace-%s", parentSessionID)
-		reusingPVC = true
-		log.Printf("Session continuation: reusing PVC %s from parent session %s", pvcName, parentSessionID)
-		// No owner refs - we don't own the parent's PVC
-	} else {
-		// New session: create fresh PVC with owner refs
-		pvcName = fmt.Sprintf("ambient-workspace-%s", name)
-		ownerRefs = []v1.OwnerReference{
-			{
-				APIVersion: "vteam.ambient-code/v1",
-				Kind:       "AgenticSession",
-				Name:       currentObj.GetName(),
-				UID:        currentObj.GetUID(),
-				Controller: boolPtr(true),
-				// BlockOwnerDeletion intentionally omitted to avoid permission issues
-			},
-		}
-	}
-
-	// Ensure PVC exists (skip for continuation if parent's PVC should exist)
-	if !reusingPVC {
-		if err := services.EnsureSessionWorkspacePVC(sessionNamespace, pvcName, ownerRefs); err != nil {
-			log.Printf("Failed to ensure session PVC %s in %s: %v", pvcName, sessionNamespace, err)
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionPVCReady,
-				Status:  "False",
-				Reason:  "ProvisioningFailed",
-				Message: err.Error(),
-			})
-		} else {
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionPVCReady,
-				Status:  "True",
-				Reason:  "Bound",
-				Message: fmt.Sprintf("PVC %s ready", pvcName),
-			})
-		}
-	} else {
-		// Verify parent's PVC exists
-		if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(sessionNamespace).Get(context.TODO(), pvcName, v1.GetOptions{}); err != nil {
-			log.Printf("Warning: Parent PVC %s not found for continuation session %s: %v", pvcName, name, err)
-			// Fall back to creating new PVC with current session's owner refs
-			pvcName = fmt.Sprintf("ambient-workspace-%s", name)
-			ownerRefs = []v1.OwnerReference{
-				{
-					APIVersion: "vteam.ambient-code/v1",
-					Kind:       "AgenticSession",
-					Name:       currentObj.GetName(),
-					UID:        currentObj.GetUID(),
-					Controller: boolPtr(true),
-				},
-			}
-			if err := services.EnsureSessionWorkspacePVC(sessionNamespace, pvcName, ownerRefs); err != nil {
-				log.Printf("Failed to create fallback PVC %s: %v", pvcName, err)
-				statusPatch.AddCondition(conditionUpdate{
-					Type:    conditionPVCReady,
-					Status:  "False",
-					Reason:  "ProvisioningFailed",
-					Message: err.Error(),
-				})
-			} else {
-				statusPatch.AddCondition(conditionUpdate{
-					Type:    conditionPVCReady,
-					Status:  "True",
-					Reason:  "Bound",
-					Message: fmt.Sprintf("PVC %s ready", pvcName),
-				})
-			}
-		} else {
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionPVCReady,
-				Status:  "True",
-				Reason:  "Reused",
-				Message: fmt.Sprintf("Reused PVC %s from parent session", pvcName),
-			})
-		}
-	}
+	// EmptyDir replaces PVC - session state persists in S3
+	log.Printf("Session will use EmptyDir with S3 state persistence")
 
 	// Load config for this session
 	appConfig := config.LoadConfig()
@@ -795,61 +627,49 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		log.Printf("Langfuse disabled, skipping secret copy")
 	}
 
-	// CRITICAL: Delete temp content pod before creating Job to avoid PVC mount conflict
-	// The PVC is ReadWriteOnce, so only one pod can mount it at a time
-	tempPodName = fmt.Sprintf("temp-content-%s", name)
-	if _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{}); err == nil {
-		log.Printf("[PVCConflict] Deleting temp pod %s before creating Job (ReadWriteOnce PVC)", tempPodName)
-
-		// Force immediate termination with zero grace period
-		gracePeriod := int64(0)
-		deleteOptions := v1.DeleteOptions{
-			GracePeriodSeconds: &gracePeriod,
-		}
-		if err := config.K8sClient.CoreV1().Pods(sessionNamespace).Delete(context.TODO(), tempPodName, deleteOptions); err != nil && !errors.IsNotFound(err) {
-			log.Printf("[PVCConflict] Warning: failed to delete temp pod: %v", err)
-		}
+	// Create a Kubernetes Pod for this AgenticSession
+	podName := fmt.Sprintf("%s-runner", name)
 
-		// Wait for temp pod to fully terminate to prevent PVC mount conflicts
-		// This is critical because ReadWriteOnce PVCs cannot be mounted by multiple pods
-		// With gracePeriod=0, this should complete in 1-3 seconds
-		log.Printf("[PVCConflict] Waiting for temp pod %s to fully terminate...", tempPodName)
-		maxWaitSeconds := 10                    // Reduced from 30 since we're force-deleting
-		for i := 0; i < maxWaitSeconds*4; i++ { // Poll 4x per second for faster detection
-			_, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{})
-			if errors.IsNotFound(err) {
-				elapsed := float64(i) * 0.25
-				log.Printf("[PVCConflict] Temp pod fully terminated after %.2f seconds", elapsed)
-				break
-			}
-			if i == (maxWaitSeconds*4)-1 {
-				log.Printf("[PVCConflict] Warning: temp pod still exists after %d seconds, proceeding anyway", maxWaitSeconds)
+	// Ensure runner token exists before creating pod
+	// This handles cases where sessions are created directly via kubectl (bypassing the backend)
+	// or when the backend failed to provision the token
+	runnerTokenSecretName := fmt.Sprintf("ambient-runner-token-%s", name)
+	if _, err := config.K8sClient.CoreV1().Secrets(sessionNamespace).Get(context.TODO(), runnerTokenSecretName, v1.GetOptions{}); err != nil {
+		if errors.IsNotFound(err) {
+			log.Printf("Runner token secret %s not found, creating it now", runnerTokenSecretName)
+			if err := regenerateRunnerToken(sessionNamespace, name, currentObj); err != nil {
+				errMsg := fmt.Sprintf("Failed to provision runner token: %v", err)
+				log.Print(errMsg)
+				statusPatch.SetField("phase", "Failed")
+				statusPatch.AddCondition(conditionUpdate{
+					Type:    conditionReady,
+					Status:  "False",
+					Reason:  "TokenProvisionFailed",
+					Message: errMsg,
+				})
+				_ = statusPatch.Apply()
+				return fmt.Errorf("failed to provision runner token for session %s: %v", name, err)
 			}
-			time.Sleep(250 * time.Millisecond) // Poll every 250ms instead of 1s
+			log.Printf("Successfully provisioned runner token for session %s", name)
+		} else {
+			log.Printf("Warning: error checking runner token secret: %v", err)
 		}
-
-		// Clear temp pod annotations since we're starting the session
-		_ = clearAnnotation(sessionNamespace, name, tempContentRequestedAnnotation)
-		_ = clearAnnotation(sessionNamespace, name, tempContentLastAccessedAnnotation)
 	}
 
-	// Create a Kubernetes Job for this AgenticSession
-	jobName := fmt.Sprintf("%s-job", name)
-
-	// Check if job already exists in the session's namespace
-	_, err = config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+	// Check if pod already exists in the session's namespace
+	_, err = config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 	if err == nil {
-		log.Printf("Job %s already exists for AgenticSession %s", jobName, name)
+		log.Printf("Pod %s already exists for AgenticSession %s", podName, name)
 		statusPatch.SetField("phase", "Creating")
 		statusPatch.SetField("observedGeneration", currentObj.GetGeneration())
 		statusPatch.AddCondition(conditionUpdate{
-			Type:    conditionJobCreated,
+			Type:    conditionPodCreated,
 			Status:  "True",
-			Reason:  "JobExists",
-			Message: "Runner job already exists",
+			Reason:  "PodExists",
+			Message: "Runner pod already exists",
 		})
 		_ = statusPatch.Apply()
-		// Clear desired-phase annotation if it exists (job already created)
+		// Clear desired-phase annotation if it exists (pod already created)
 		_ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase")
 		return nil
 	}
@@ -927,7 +747,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 	var repos []RepoConfig
 
-	// Read simplified repos[] array format
+	// Read repos[] array format
 	if reposArr, found, _ := unstructured.NestedSlice(spec, "repos"); found && len(reposArr) > 0 {
 		repos = make([]RepoConfig, 0, len(reposArr))
 		for _, repoItem := range reposArr {
@@ -946,34 +766,6 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				}
 			}
 		}
-	} else {
-		// Fallback to old format for backward compatibility (input/output structure)
-		inputRepo, _, _ := unstructured.NestedString(spec, "inputRepo")
-		inputBranch, _, _ := unstructured.NestedString(spec, "inputBranch")
-		if v, found, _ := unstructured.NestedString(spec, "input", "repo"); found && strings.TrimSpace(v) != "" {
-			inputRepo = v
-		}
-		if v, found, _ := unstructured.NestedString(spec, "input", "branch"); found && strings.TrimSpace(v) != "" {
-			inputBranch = v
-		}
-		if inputRepo != "" {
-			if inputBranch == "" {
-				inputBranch = "main"
-			}
-			repos = []RepoConfig{{
-				URL:    inputRepo,
-				Branch: inputBranch,
-			}}
-		}
-	}
-
-	// Get first repo for backward compatibility env vars (first repo is always main repo)
-	var inputRepo, inputBranch, outputRepo, outputBranch string
-	if len(repos) > 0 {
-		inputRepo = repos[0].URL
-		inputBranch = repos[0].Branch
-		outputRepo = repos[0].URL // Output same as input in simplified format
-		outputBranch = repos[0].Branch
 	}
 
 	// Read autoPushOnComplete flag
@@ -992,18 +784,45 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	}
 	log.Printf("Session %s initiated by user: %s (userId: %s)", name, userName, userID)
 
-	// Create the Job
-	job := &batchv1.Job{
+	// Get S3 configuration for this project (from project secret or operator defaults)
+	s3Endpoint, s3Bucket, s3AccessKey, s3SecretKey, err := getS3ConfigForProject(sessionNamespace, appConfig)
+	if err != nil {
+		log.Printf("Warning: S3 not available for project %s: %v (sessions will use ephemeral storage only)", sessionNamespace, err)
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    "S3Available",
+			Status:  "False",
+			Reason:  "NotConfigured",
+			Message: fmt.Sprintf("S3 storage not configured: %v. Session state will not persist across pod restarts. Configure S3 in project settings.", err),
+		})
+		// Set empty values - init-hydrate and state-sync will skip S3 operations
+		s3Endpoint = ""
+		s3Bucket = ""
+		s3AccessKey = ""
+		s3SecretKey = ""
+	} else {
+		log.Printf("S3 configured for project %s: endpoint=%s, bucket=%s", sessionNamespace, s3Endpoint, s3Bucket)
+		statusPatch.AddCondition(conditionUpdate{
+			Type:    "S3Available",
+			Status:  "True",
+			Reason:  "Configured",
+			Message: fmt.Sprintf("S3 storage configured: %s/%s", s3Endpoint, s3Bucket),
+		})
+	}
+
+	// Create the Pod directly (no Job wrapper for faster startup)
+	pod := &corev1.Pod{
 		ObjectMeta: v1.ObjectMeta{
-			Name:      jobName,
+			Name:      podName,
 			Namespace: sessionNamespace,
 			Labels: map[string]string{
 				"agentic-session": name,
 				"app":             "ambient-code-runner",
 			},
+			// If you run a service mesh that injects sidecars and causes egress issues:
+			// Annotations: map[string]string{"sidecar.istio.io/inject": "false"},
 			OwnerReferences: []v1.OwnerReference{
 				{
-					APIVersion: "vteam.ambient-code/v1",
+					APIVersion: "vteam.ambient-code/v1alpha1",
 					Kind:       "AgenticSession",
 					Name:       currentObj.GetName(),
 					UID:        currentObj.GetUID(),
@@ -1013,339 +832,418 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				},
 			},
 		},
-		Spec: batchv1.JobSpec{
-			BackoffLimit:          int32Ptr(3),
-			ActiveDeadlineSeconds: int64Ptr(14400), // 4 hour timeout for safety
-			// Auto-cleanup finished Jobs if TTL controller is enabled in the cluster
-			TTLSecondsAfterFinished: int32Ptr(600),
-			Template: corev1.PodTemplateSpec{
-				ObjectMeta: v1.ObjectMeta{
-					Labels: map[string]string{
-						"agentic-session": name,
-						"app":             "ambient-code-runner",
+		Spec: corev1.PodSpec{
+			RestartPolicy:                 corev1.RestartPolicyNever,
+			TerminationGracePeriodSeconds: int64Ptr(30), // Allow time for state-sync final sync
+			// Explicitly set service account for pod creation permissions
+			AutomountServiceAccountToken: boolPtr(false),
+			Volumes: []corev1.Volume{
+				{
+					Name: "workspace",
+					VolumeSource: corev1.VolumeSource{
+						EmptyDir: &corev1.EmptyDirVolumeSource{
+							SizeLimit: resource.NewQuantity(10*1024*1024*1024, resource.BinarySI), // 10Gi
+						},
 					},
-					// If you run a service mesh that injects sidecars and causes egress issues for Jobs:
-					// Annotations: map[string]string{"sidecar.istio.io/inject": "false"},
 				},
-				Spec: corev1.PodSpec{
-					RestartPolicy: corev1.RestartPolicyNever,
-					// Explicitly set service account for pod creation permissions
-					AutomountServiceAccountToken: boolPtr(false),
-					Volumes: []corev1.Volume{
-						{
-							Name: "workspace",
-							VolumeSource: corev1.VolumeSource{
-								PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
-									ClaimName: pvcName,
-								},
-							},
+			},
+
+			// InitContainer to hydrate session state from S3
+			InitContainers: []corev1.Container{
+				{
+					Name:            "init-hydrate",
+					Image:           appConfig.StateSyncImage,
+					ImagePullPolicy: appConfig.ImagePullPolicy,
+					Command:         []string{"/usr/local/bin/hydrate.sh"},
+					SecurityContext: &corev1.SecurityContext{
+						AllowPrivilegeEscalation: boolPtr(false),
+						ReadOnlyRootFilesystem:   boolPtr(false),
+						Capabilities: &corev1.Capabilities{
+							Drop: []corev1.Capability{"ALL"},
 						},
 					},
+					Env: func() []corev1.EnvVar {
+						base := []corev1.EnvVar{
+							{Name: "SESSION_NAME", Value: name},
+							{Name: "NAMESPACE", Value: sessionNamespace},
+							{Name: "S3_ENDPOINT", Value: s3Endpoint},
+							{Name: "S3_BUCKET", Value: s3Bucket},
+							{Name: "AWS_ACCESS_KEY_ID", Value: s3AccessKey},
+							{Name: "AWS_SECRET_ACCESS_KEY", Value: s3SecretKey},
+							{Name: "GIT_USER_NAME", Value: os.Getenv("GIT_USER_NAME")},
+							{Name: "GIT_USER_EMAIL", Value: os.Getenv("GIT_USER_EMAIL")},
+						}
+
+						// Add repos JSON if present
+						if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 {
+							b, _ := json.Marshal(repos)
+							base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)})
+						}
+
+						// Add workflow info if present
+						if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok {
+							if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" {
+								base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL})
+							}
+							if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" {
+								base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch})
+							}
+							if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" {
+								base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path})
+							}
+						}
+
+						// Add GitHub token for private repos
+						secretName := ""
+						if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok {
+							if anns, ok := meta["annotations"].(map[string]interface{}); ok {
+								if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" {
+									secretName = strings.TrimSpace(v)
+								}
+							}
+						}
+						if secretName == "" {
+							secretName = fmt.Sprintf("ambient-runner-token-%s", name)
+						}
+						base = append(base, corev1.EnvVar{
+							Name: "BOT_TOKEN",
+							ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{
+								LocalObjectReference: corev1.LocalObjectReference{Name: secretName},
+								Key:                  "k8s-token",
+							}},
+						})
 
-					// InitContainer to ensure workspace directory structure exists
-					InitContainers: []corev1.Container{
-						{
-							Name:  "init-workspace",
-							Image: "registry.access.redhat.com/ubi8/ubi-minimal:latest",
-							Command: []string{
-								"sh", "-c",
-								fmt.Sprintf("mkdir -p /workspace/sessions/%s/workspace && chmod 777 /workspace/sessions/%s/workspace && echo 'Workspace initialized'", name, name),
-							},
-							VolumeMounts: []corev1.VolumeMount{
-								{Name: "workspace", MountPath: "/workspace"},
-							},
-						},
+						return base
+					}(),
+					VolumeMounts: []corev1.VolumeMount{
+						{Name: "workspace", MountPath: "/workspace"},
 					},
+				},
+			},
 
-					// Flip roles so the content writer is the main container that keeps the pod alive
-					Containers: []corev1.Container{
-						{
-							Name:            "ambient-content",
-							Image:           appConfig.ContentServiceImage,
-							ImagePullPolicy: appConfig.ImagePullPolicy,
-							Env: []corev1.EnvVar{
-								{Name: "CONTENT_SERVICE_MODE", Value: "true"},
-								{Name: "STATE_BASE_DIR", Value: "/workspace"},
-							},
-							Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}},
-							ReadinessProbe: &corev1.Probe{
-								ProbeHandler: corev1.ProbeHandler{
-									HTTPGet: &corev1.HTTPGetAction{
-										Path: "/health",
-										Port: intstr.FromString("http"),
-									},
-								},
-								InitialDelaySeconds: 5,
-								PeriodSeconds:       5,
+			// Flip roles so the content writer is the main container that keeps the pod alive
+			Containers: []corev1.Container{
+				{
+					Name:            "ambient-content",
+					Image:           appConfig.ContentServiceImage,
+					ImagePullPolicy: appConfig.ImagePullPolicy,
+					Env: []corev1.EnvVar{
+						{Name: "CONTENT_SERVICE_MODE", Value: "true"},
+						{Name: "STATE_BASE_DIR", Value: "/workspace"},
+					},
+					Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}},
+					ReadinessProbe: &corev1.Probe{
+						ProbeHandler: corev1.ProbeHandler{
+							HTTPGet: &corev1.HTTPGetAction{
+								Path: "/health",
+								Port: intstr.FromString("http"),
 							},
-							VolumeMounts: []corev1.VolumeMount{{Name: "workspace", MountPath: "/workspace"}},
 						},
-						{
-							Name:            "ambient-code-runner",
-							Image:           appConfig.AmbientCodeRunnerImage,
-							ImagePullPolicy: appConfig.ImagePullPolicy,
-							// 🔒 Container-level security (SCC-compatible, no privileged capabilities)
-							SecurityContext: &corev1.SecurityContext{
-								AllowPrivilegeEscalation: boolPtr(false),
-								ReadOnlyRootFilesystem:   boolPtr(false), // Playwright needs to write temp files
-								Capabilities: &corev1.Capabilities{
-									Drop: []corev1.Capability{"ALL"}, // Drop all capabilities for security
-								},
-							},
-
-							// Expose AG-UI server port for backend proxy
-							Ports: []corev1.ContainerPort{{
-								Name:          "agui",
-								ContainerPort: 8001,
-								Protocol:      corev1.ProtocolTCP,
-							}},
-
-							VolumeMounts: []corev1.VolumeMount{
-								{Name: "workspace", MountPath: "/workspace", ReadOnly: false},
-								// Mount .claude directory for session state persistence
-								// This enables SDK's built-in resume functionality
-								{Name: "workspace", MountPath: "/app/.claude", SubPath: fmt.Sprintf("sessions/%s/.claude", name), ReadOnly: false},
-							},
-
-							Env: func() []corev1.EnvVar {
-								base := []corev1.EnvVar{
-									{Name: "DEBUG", Value: "true"},
-									{Name: "INTERACTIVE", Value: fmt.Sprintf("%t", interactive)},
-									{Name: "AGENTIC_SESSION_NAME", Value: name},
-									{Name: "AGENTIC_SESSION_NAMESPACE", Value: sessionNamespace},
-									// Provide session id and workspace path for the runner wrapper
-									{Name: "SESSION_ID", Value: name},
-									{Name: "WORKSPACE_PATH", Value: fmt.Sprintf("/workspace/sessions/%s/workspace", name)},
-									{Name: "ARTIFACTS_DIR", Value: "_artifacts"},
-									// Google MCP credentials directory for workspace-mcp server (writable workspace location)
-									{Name: "GOOGLE_MCP_CREDENTIALS_DIR", Value: "/workspace/.google_workspace_mcp/credentials"},
-									// Google OAuth client credentials for workspace-mcp
-									{Name: "GOOGLE_OAUTH_CLIENT_ID", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_ID")},
-									{Name: "GOOGLE_OAUTH_CLIENT_SECRET", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_SECRET")},
-								}
+						InitialDelaySeconds: 5,
+						PeriodSeconds:       5,
+					},
+					VolumeMounts: []corev1.VolumeMount{{Name: "workspace", MountPath: "/workspace"}},
+				},
+				{
+					Name:            "ambient-code-runner",
+					Image:           appConfig.AmbientCodeRunnerImage,
+					ImagePullPolicy: appConfig.ImagePullPolicy,
+					// 🔒 Container-level security (SCC-compatible, no privileged capabilities)
+					SecurityContext: &corev1.SecurityContext{
+						AllowPrivilegeEscalation: boolPtr(false),
+						ReadOnlyRootFilesystem:   boolPtr(false), // Playwright needs to write temp files
+						Capabilities: &corev1.Capabilities{
+							Drop: []corev1.Capability{"ALL"}, // Drop all capabilities for security
+						},
+					},
 
-								// Add user context for observability and auditing (Langfuse userId, logs, etc.)
-								if userID != "" {
-									base = append(base, corev1.EnvVar{Name: "USER_ID", Value: userID})
-								}
-								if userName != "" {
-									base = append(base, corev1.EnvVar{Name: "USER_NAME", Value: userName})
-								}
+					// Expose AG-UI server port for backend proxy
+					Ports: []corev1.ContainerPort{{
+						Name:          "agui",
+						ContainerPort: 8001,
+						Protocol:      corev1.ProtocolTCP,
+					}},
 
-								// Add per-repo environment variables (simplified format)
-								for i, repo := range repos {
-									base = append(base,
-										corev1.EnvVar{Name: fmt.Sprintf("REPO_%d_URL", i), Value: repo.URL},
-										corev1.EnvVar{Name: fmt.Sprintf("REPO_%d_BRANCH", i), Value: repo.Branch},
-									)
-								}
+					VolumeMounts: []corev1.VolumeMount{
+						{Name: "workspace", MountPath: "/workspace", ReadOnly: false},
+						// Mount .claude directory for session state persistence (synced to S3)
+						// This enables SDK's built-in resume functionality
+						{Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude", ReadOnly: false},
+					},
 
-								// Backward compatibility: set INPUT_REPO_URL/OUTPUT_REPO_URL from main repo
-								base = append(base,
-									corev1.EnvVar{Name: "INPUT_REPO_URL", Value: inputRepo},
-									corev1.EnvVar{Name: "INPUT_BRANCH", Value: inputBranch},
-									corev1.EnvVar{Name: "OUTPUT_REPO_URL", Value: outputRepo},
-									corev1.EnvVar{Name: "OUTPUT_BRANCH", Value: outputBranch},
-									corev1.EnvVar{Name: "INITIAL_PROMPT", Value: prompt},
-									corev1.EnvVar{Name: "LLM_MODEL", Value: model},
-									corev1.EnvVar{Name: "LLM_TEMPERATURE", Value: fmt.Sprintf("%.2f", temperature)},
-									corev1.EnvVar{Name: "LLM_MAX_TOKENS", Value: fmt.Sprintf("%d", maxTokens)},
-									corev1.EnvVar{Name: "USE_AGUI", Value: "true"},
-									corev1.EnvVar{Name: "TIMEOUT", Value: fmt.Sprintf("%d", timeout)},
-									corev1.EnvVar{Name: "AUTO_PUSH_ON_COMPLETE", Value: fmt.Sprintf("%t", autoPushOnComplete)},
-									corev1.EnvVar{Name: "BACKEND_API_URL", Value: fmt.Sprintf("http://backend-service.%s.svc.cluster.local:8080/api", appConfig.BackendNamespace)},
-									// LEGACY: WEBSOCKET_URL removed - runner now uses AG-UI server pattern (FastAPI)
-									// Backend proxies to runner's HTTP endpoint instead of WebSocket
-								)
-
-								// Platform-wide Langfuse observability configuration
-								// Uses secretKeyRef to prevent credential exposure in pod specs
-								// Secret is copied to session namespace from operator namespace
-								// All keys are optional to prevent pod startup failures if keys are missing
-								if ambientLangfuseSecretCopied {
-									base = append(base,
-										corev1.EnvVar{
-											Name: "LANGFUSE_ENABLED",
-											ValueFrom: &corev1.EnvVarSource{
-												SecretKeyRef: &corev1.SecretKeySelector{
-													LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
-													Key:                  "LANGFUSE_ENABLED",
-													Optional:             boolPtr(true),
-												},
-											},
+					Env: func() []corev1.EnvVar {
+						base := []corev1.EnvVar{
+							{Name: "DEBUG", Value: "true"},
+							{Name: "INTERACTIVE", Value: fmt.Sprintf("%t", interactive)},
+							{Name: "AGENTIC_SESSION_NAME", Value: name},
+							{Name: "AGENTIC_SESSION_NAMESPACE", Value: sessionNamespace},
+							// Provide session id and workspace path for the runner wrapper
+							{Name: "SESSION_ID", Value: name},
+							{Name: "WORKSPACE_PATH", Value: "/workspace"},
+							{Name: "ARTIFACTS_DIR", Value: "artifacts"},
+							// Google MCP credentials directory for workspace-mcp server (writable workspace location)
+							{Name: "GOOGLE_MCP_CREDENTIALS_DIR", Value: "/workspace/.google_workspace_mcp/credentials"},
+							// Google OAuth client credentials for workspace-mcp
+							{Name: "GOOGLE_OAUTH_CLIENT_ID", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_ID")},
+							{Name: "GOOGLE_OAUTH_CLIENT_SECRET", Value: os.Getenv("GOOGLE_OAUTH_CLIENT_SECRET")},
+						}
+
+						// Add user context for observability and auditing (Langfuse userId, logs, etc.)
+						if userID != "" {
+							base = append(base, corev1.EnvVar{Name: "USER_ID", Value: userID})
+						}
+						if userName != "" {
+							base = append(base, corev1.EnvVar{Name: "USER_NAME", Value: userName})
+						}
+
+						// Core session env vars
+						base = append(base,
+							corev1.EnvVar{Name: "INITIAL_PROMPT", Value: prompt},
+							corev1.EnvVar{Name: "LLM_MODEL", Value: model},
+							corev1.EnvVar{Name: "LLM_TEMPERATURE", Value: fmt.Sprintf("%.2f", temperature)},
+							corev1.EnvVar{Name: "LLM_MAX_TOKENS", Value: fmt.Sprintf("%d", maxTokens)},
+							corev1.EnvVar{Name: "USE_AGUI", Value: "true"},
+							corev1.EnvVar{Name: "TIMEOUT", Value: fmt.Sprintf("%d", timeout)},
+							corev1.EnvVar{Name: "AUTO_PUSH_ON_COMPLETE", Value: fmt.Sprintf("%t", autoPushOnComplete)},
+							corev1.EnvVar{Name: "BACKEND_API_URL", Value: fmt.Sprintf("http://backend-service.%s.svc.cluster.local:8080/api", appConfig.BackendNamespace)},
+							// LEGACY: WEBSOCKET_URL removed - runner now uses AG-UI server pattern (FastAPI)
+							// Backend proxies to runner's HTTP endpoint instead of WebSocket
+						)
+
+						// Platform-wide Langfuse observability configuration
+						// Uses secretKeyRef to prevent credential exposure in pod specs
+						// Secret is copied to session namespace from operator namespace
+						// All keys are optional to prevent pod startup failures if keys are missing
+						if ambientLangfuseSecretCopied {
+							base = append(base,
+								corev1.EnvVar{
+									Name: "LANGFUSE_ENABLED",
+									ValueFrom: &corev1.EnvVarSource{
+										SecretKeyRef: &corev1.SecretKeySelector{
+											LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
+											Key:                  "LANGFUSE_ENABLED",
+											Optional:             boolPtr(true),
 										},
-										corev1.EnvVar{
-											Name: "LANGFUSE_HOST",
-											ValueFrom: &corev1.EnvVarSource{
-												SecretKeyRef: &corev1.SecretKeySelector{
-													LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
-													Key:                  "LANGFUSE_HOST",
-													Optional:             boolPtr(true),
-												},
-											},
+									},
+								},
+								corev1.EnvVar{
+									Name: "LANGFUSE_HOST",
+									ValueFrom: &corev1.EnvVarSource{
+										SecretKeyRef: &corev1.SecretKeySelector{
+											LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
+											Key:                  "LANGFUSE_HOST",
+											Optional:             boolPtr(true),
 										},
-										corev1.EnvVar{
-											Name: "LANGFUSE_PUBLIC_KEY",
-											ValueFrom: &corev1.EnvVarSource{
-												SecretKeyRef: &corev1.SecretKeySelector{
-													LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
-													Key:                  "LANGFUSE_PUBLIC_KEY",
-													Optional:             boolPtr(true),
-												},
-											},
+									},
+								},
+								corev1.EnvVar{
+									Name: "LANGFUSE_PUBLIC_KEY",
+									ValueFrom: &corev1.EnvVarSource{
+										SecretKeyRef: &corev1.SecretKeySelector{
+											LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
+											Key:                  "LANGFUSE_PUBLIC_KEY",
+											Optional:             boolPtr(true),
 										},
-										corev1.EnvVar{
-											Name: "LANGFUSE_SECRET_KEY",
-											ValueFrom: &corev1.EnvVarSource{
-												SecretKeyRef: &corev1.SecretKeySelector{
-													LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
-													Key:                  "LANGFUSE_SECRET_KEY",
-													Optional:             boolPtr(true),
-												},
-											},
+									},
+								},
+								corev1.EnvVar{
+									Name: "LANGFUSE_SECRET_KEY",
+									ValueFrom: &corev1.EnvVarSource{
+										SecretKeyRef: &corev1.SecretKeySelector{
+											LocalObjectReference: corev1.LocalObjectReference{Name: "ambient-admin-langfuse-secret"},
+											Key:                  "LANGFUSE_SECRET_KEY",
+											Optional:             boolPtr(true),
 										},
-									)
-									log.Printf("Langfuse env vars configured via secretKeyRef for session %s", name)
+									},
+								},
+							)
+							log.Printf("Langfuse env vars configured via secretKeyRef for session %s", name)
+						}
+
+						// Add Vertex AI configuration only if enabled
+						if vertexEnabled {
+							base = append(base,
+								corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "1"},
+								corev1.EnvVar{Name: "CLOUD_ML_REGION", Value: os.Getenv("CLOUD_ML_REGION")},
+								corev1.EnvVar{Name: "ANTHROPIC_VERTEX_PROJECT_ID", Value: os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID")},
+								corev1.EnvVar{Name: "GOOGLE_APPLICATION_CREDENTIALS", Value: os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")},
+							)
+						} else {
+							// Explicitly set to 0 when Vertex is disabled
+							base = append(base, corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "0"})
+						}
+
+						// Add PARENT_SESSION_ID if this is a continuation
+						if parentSessionID != "" {
+							base = append(base, corev1.EnvVar{Name: "PARENT_SESSION_ID", Value: parentSessionID})
+							log.Printf("Session %s: passing PARENT_SESSION_ID=%s to runner", name, parentSessionID)
+						}
+
+						// Add IS_RESUME if this session has been started before
+						// Check status.startTime - if present, this is a resume (pod recreate/restart)
+						// This tells the runner to skip INITIAL_PROMPT and use continue_conversation
+						if status, found, _ := unstructured.NestedMap(currentObj.Object, "status"); found {
+							if startTime, ok := status["startTime"].(string); ok && startTime != "" {
+								base = append(base, corev1.EnvVar{Name: "IS_RESUME", Value: "true"})
+								log.Printf("Session %s: marking as resume (IS_RESUME=true, startTime=%s)", name, startTime)
+							}
+						}
+
+						// If backend annotated the session with a runner token secret, inject only BOT_TOKEN
+						// Secret contains: 'k8s-token' (for CR updates)
+						// Prefer annotated secret name; fallback to deterministic name
+						secretName := ""
+						if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok {
+							if anns, ok := meta["annotations"].(map[string]interface{}); ok {
+								if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" {
+									secretName = strings.TrimSpace(v)
 								}
-
-								// Add Vertex AI configuration only if enabled
-								if vertexEnabled {
-									base = append(base,
-										corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "1"},
-										corev1.EnvVar{Name: "CLOUD_ML_REGION", Value: os.Getenv("CLOUD_ML_REGION")},
-										corev1.EnvVar{Name: "ANTHROPIC_VERTEX_PROJECT_ID", Value: os.Getenv("ANTHROPIC_VERTEX_PROJECT_ID")},
-										corev1.EnvVar{Name: "GOOGLE_APPLICATION_CREDENTIALS", Value: os.Getenv("GOOGLE_APPLICATION_CREDENTIALS")},
-									)
-								} else {
-									// Explicitly set to 0 when Vertex is disabled
-									base = append(base, corev1.EnvVar{Name: "CLAUDE_CODE_USE_VERTEX", Value: "0"})
+							}
+						}
+						if secretName == "" {
+							secretName = fmt.Sprintf("ambient-runner-token-%s", name)
+						}
+						base = append(base, corev1.EnvVar{
+							Name: "BOT_TOKEN",
+							ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{
+								LocalObjectReference: corev1.LocalObjectReference{Name: secretName},
+								Key:                  "k8s-token",
+							}},
+						})
+						// Add CR-provided envs last (override base when same key)
+						if spec, ok := currentObj.Object["spec"].(map[string]interface{}); ok {
+							// Inject REPOS_JSON and MAIN_REPO_NAME from spec.repos and spec.mainRepoName if present
+							if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 {
+								// Use a minimal JSON serialization via fmt (we'll rely on client to pass REPOS_JSON too)
+								// This ensures runner gets repos even if env vars weren't passed from frontend
+								b, _ := json.Marshal(repos)
+								base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)})
+							}
+							if mrn, ok := spec["mainRepoName"].(string); ok && strings.TrimSpace(mrn) != "" {
+								base = append(base, corev1.EnvVar{Name: "MAIN_REPO_NAME", Value: mrn})
+							}
+							// Inject MAIN_REPO_INDEX if provided
+							if mriRaw, ok := spec["mainRepoIndex"]; ok {
+								switch v := mriRaw.(type) {
+								case int64:
+									base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
+								case int32:
+									base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
+								case int:
+									base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
+								case float64:
+									base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", int64(v))})
+								case string:
+									if strings.TrimSpace(v) != "" {
+										base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: v})
+									}
 								}
-
-								// Add PARENT_SESSION_ID if this is a continuation
-								if parentSessionID != "" {
-									base = append(base, corev1.EnvVar{Name: "PARENT_SESSION_ID", Value: parentSessionID})
-									log.Printf("Session %s: passing PARENT_SESSION_ID=%s to runner", name, parentSessionID)
+							}
+							// Inject activeWorkflow environment variables if present
+							if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok {
+								if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" {
+									base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL})
 								}
-								// If backend annotated the session with a runner token secret, inject only BOT_TOKEN
-								// Secret contains: 'k8s-token' (for CR updates)
-								// Prefer annotated secret name; fallback to deterministic name
-								secretName := ""
-								if meta, ok := currentObj.Object["metadata"].(map[string]interface{}); ok {
-									if anns, ok := meta["annotations"].(map[string]interface{}); ok {
-										if v, ok := anns["ambient-code.io/runner-token-secret"].(string); ok && strings.TrimSpace(v) != "" {
-											secretName = strings.TrimSpace(v)
-										}
-									}
+								if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" {
+									base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch})
 								}
-								if secretName == "" {
-									secretName = fmt.Sprintf("ambient-runner-token-%s", name)
+								if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" {
+									base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path})
 								}
-								base = append(base, corev1.EnvVar{
-									Name: "BOT_TOKEN",
-									ValueFrom: &corev1.EnvVarSource{SecretKeyRef: &corev1.SecretKeySelector{
-										LocalObjectReference: corev1.LocalObjectReference{Name: secretName},
-										Key:                  "k8s-token",
-									}},
-								})
-								// Add CR-provided envs last (override base when same key)
-								if spec, ok := currentObj.Object["spec"].(map[string]interface{}); ok {
-									// Inject REPOS_JSON and MAIN_REPO_NAME from spec.repos and spec.mainRepoName if present
-									if repos, ok := spec["repos"].([]interface{}); ok && len(repos) > 0 {
-										// Use a minimal JSON serialization via fmt (we'll rely on client to pass REPOS_JSON too)
-										// This ensures runner gets repos even if env vars weren't passed from frontend
-										b, _ := json.Marshal(repos)
-										base = append(base, corev1.EnvVar{Name: "REPOS_JSON", Value: string(b)})
-									}
-									if mrn, ok := spec["mainRepoName"].(string); ok && strings.TrimSpace(mrn) != "" {
-										base = append(base, corev1.EnvVar{Name: "MAIN_REPO_NAME", Value: mrn})
-									}
-									// Inject MAIN_REPO_INDEX if provided
-									if mriRaw, ok := spec["mainRepoIndex"]; ok {
-										switch v := mriRaw.(type) {
-										case int64:
-											base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
-										case int32:
-											base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
-										case int:
-											base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", v)})
-										case float64:
-											base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: fmt.Sprintf("%d", int64(v))})
-										case string:
-											if strings.TrimSpace(v) != "" {
-												base = append(base, corev1.EnvVar{Name: "MAIN_REPO_INDEX", Value: v})
+							}
+							if envMap, ok := spec["environmentVariables"].(map[string]interface{}); ok {
+								for k, v := range envMap {
+									if vs, ok := v.(string); ok {
+										// replace if exists
+										replaced := false
+										for i := range base {
+											if base[i].Name == k {
+												base[i].Value = vs
+												replaced = true
+												break
 											}
 										}
-									}
-									// Inject activeWorkflow environment variables if present
-									if workflow, ok := spec["activeWorkflow"].(map[string]interface{}); ok {
-										if gitURL, ok := workflow["gitUrl"].(string); ok && strings.TrimSpace(gitURL) != "" {
-											base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_GIT_URL", Value: gitURL})
-										}
-										if branch, ok := workflow["branch"].(string); ok && strings.TrimSpace(branch) != "" {
-											base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_BRANCH", Value: branch})
-										}
-										if path, ok := workflow["path"].(string); ok && strings.TrimSpace(path) != "" {
-											base = append(base, corev1.EnvVar{Name: "ACTIVE_WORKFLOW_PATH", Value: path})
+										if !replaced {
+											base = append(base, corev1.EnvVar{Name: k, Value: vs})
 										}
 									}
-									if envMap, ok := spec["environmentVariables"].(map[string]interface{}); ok {
-										for k, v := range envMap {
-											if vs, ok := v.(string); ok {
-												// replace if exists
-												replaced := false
-												for i := range base {
-													if base[i].Name == k {
-														base[i].Value = vs
-														replaced = true
-														break
-													}
-												}
-												if !replaced {
-													base = append(base, corev1.EnvVar{Name: k, Value: vs})
-												}
-											}
-										}
-									}
-								}
-
-								return base
-							}(),
-
-							// Import secrets as environment variables
-							// - integrationSecretsName: Only if exists (GIT_TOKEN, JIRA_*, custom keys)
-							// - runnerSecretsName: Only when Vertex disabled (ANTHROPIC_API_KEY)
-							// - ambient-langfuse-keys: Platform-wide Langfuse observability (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, LANGFUSE_ENABLED)
-							EnvFrom: func() []corev1.EnvFromSource {
-								sources := []corev1.EnvFromSource{}
-
-								// Only inject integration secrets if they exist (optional)
-								if integrationSecretsExist {
-									sources = append(sources, corev1.EnvFromSource{
-										SecretRef: &corev1.SecretEnvSource{
-											LocalObjectReference: corev1.LocalObjectReference{Name: integrationSecretsName},
-										},
-									})
-									log.Printf("Injecting integration secrets from '%s' for session %s", integrationSecretsName, name)
-								} else {
-									log.Printf("Skipping integration secrets '%s' for session %s (not found or not configured)", integrationSecretsName, name)
-								}
-
-								// Only inject runner secrets (ANTHROPIC_API_KEY) when Vertex is disabled
-								if !vertexEnabled && runnerSecretsName != "" {
-									sources = append(sources, corev1.EnvFromSource{
-										SecretRef: &corev1.SecretEnvSource{
-											LocalObjectReference: corev1.LocalObjectReference{Name: runnerSecretsName},
-										},
-									})
-									log.Printf("Injecting runner secrets from '%s' for session %s (Vertex disabled)", runnerSecretsName, name)
-								} else if vertexEnabled && runnerSecretsName != "" {
-									log.Printf("Skipping runner secrets '%s' for session %s (Vertex enabled)", runnerSecretsName, name)
 								}
+							}
+						}
+
+						return base
+					}(),
+
+					// Import secrets as environment variables
+					// - integrationSecretsName: Only if exists (GIT_TOKEN, JIRA_*, custom keys)
+					// - runnerSecretsName: Only when Vertex disabled (ANTHROPIC_API_KEY)
+					// - ambient-langfuse-keys: Platform-wide Langfuse observability (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST, LANGFUSE_ENABLED)
+					EnvFrom: func() []corev1.EnvFromSource {
+						sources := []corev1.EnvFromSource{}
+
+						// Only inject integration secrets if they exist (optional)
+						if integrationSecretsExist {
+							sources = append(sources, corev1.EnvFromSource{
+								SecretRef: &corev1.SecretEnvSource{
+									LocalObjectReference: corev1.LocalObjectReference{Name: integrationSecretsName},
+								},
+							})
+							log.Printf("Injecting integration secrets from '%s' for session %s", integrationSecretsName, name)
+						} else {
+							log.Printf("Skipping integration secrets '%s' for session %s (not found or not configured)", integrationSecretsName, name)
+						}
+
+						// Only inject runner secrets (ANTHROPIC_API_KEY) when Vertex is disabled
+						if !vertexEnabled && runnerSecretsName != "" {
+							sources = append(sources, corev1.EnvFromSource{
+								SecretRef: &corev1.SecretEnvSource{
+									LocalObjectReference: corev1.LocalObjectReference{Name: runnerSecretsName},
+								},
+							})
+							log.Printf("Injecting runner secrets from '%s' for session %s (Vertex disabled)", runnerSecretsName, name)
+						} else if vertexEnabled && runnerSecretsName != "" {
+							log.Printf("Skipping runner secrets '%s' for session %s (Vertex enabled)", runnerSecretsName, name)
+						}
 
-								return sources
-							}(),
+						return sources
+					}(),
 
-							Resources: corev1.ResourceRequirements{},
+					Resources: corev1.ResourceRequirements{},
+				},
+				// S3 state-sync sidecar - syncs .claude/, artifacts/, uploads/ to S3
+				{
+					Name:            "state-sync",
+					Image:           appConfig.StateSyncImage,
+					ImagePullPolicy: appConfig.ImagePullPolicy,
+					Command:         []string{"/usr/local/bin/sync.sh"},
+					SecurityContext: &corev1.SecurityContext{
+						AllowPrivilegeEscalation: boolPtr(false),
+						ReadOnlyRootFilesystem:   boolPtr(false),
+						Capabilities: &corev1.Capabilities{
+							Drop: []corev1.Capability{"ALL"},
+						},
+					},
+					Env: []corev1.EnvVar{
+						{Name: "SESSION_NAME", Value: name},
+						{Name: "NAMESPACE", Value: sessionNamespace},
+						{Name: "S3_ENDPOINT", Value: s3Endpoint},
+						{Name: "S3_BUCKET", Value: s3Bucket},
+						{Name: "SYNC_INTERVAL", Value: "60"},
+						{Name: "MAX_SYNC_SIZE", Value: "1073741824"}, // 1GB
+						{Name: "AWS_ACCESS_KEY_ID", Value: s3AccessKey},
+						{Name: "AWS_SECRET_ACCESS_KEY", Value: s3SecretKey},
+					},
+					VolumeMounts: []corev1.VolumeMount{
+						{Name: "workspace", MountPath: "/workspace", ReadOnly: false},
+					},
+					Resources: corev1.ResourceRequirements{
+						Requests: corev1.ResourceList{
+							corev1.ResourceCPU:    resource.MustParse("50m"),
+							corev1.ResourceMemory: resource.MustParse("64Mi"),
+						},
+						Limits: corev1.ResourceList{
+							corev1.ResourceCPU:    resource.MustParse("200m"),
+							corev1.ResourceMemory: resource.MustParse("256Mi"),
 						},
 					},
 				},
@@ -1358,14 +1256,14 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 	// If ambient-vertex secret was successfully copied, mount it as a volume
 	if ambientVertexSecretCopied {
-		job.Spec.Template.Spec.Volumes = append(job.Spec.Template.Spec.Volumes, corev1.Volume{
+		pod.Spec.Volumes = append(pod.Spec.Volumes, corev1.Volume{
 			Name:         "vertex",
 			VolumeSource: corev1.VolumeSource{Secret: &corev1.SecretVolumeSource{SecretName: types.AmbientVertexSecretName}},
 		})
 		// Mount to the ambient-code-runner container by name
-		for i := range job.Spec.Template.Spec.Containers {
-			if job.Spec.Template.Spec.Containers[i].Name == "ambient-code-runner" {
-				job.Spec.Template.Spec.Containers[i].VolumeMounts = append(job.Spec.Template.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{
+		for i := range pod.Spec.Containers {
+			if pod.Spec.Containers[i].Name == "ambient-code-runner" {
+				pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{
 					Name:      "vertex",
 					MountPath: "/app/vertex",
 					ReadOnly:  true,
@@ -1393,7 +1291,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				},
 				OwnerReferences: []v1.OwnerReference{
 					{
-						APIVersion: "vteam.ambient-code/v1",
+						APIVersion: "vteam.ambient-code/v1alpha1",
 						Kind:       "AgenticSession",
 						Name:       currentObj.GetName(),
 						UID:        currentObj.GetUID(),
@@ -1419,7 +1317,7 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 	// Always mount Google OAuth secret (with Optional: true so pod starts even if empty)
 	// K8s will sync updates when backend populates credentials after OAuth completion (~60s)
-	job.Spec.Template.Spec.Volumes = append(job.Spec.Template.Spec.Volumes, corev1.Volume{
+	pod.Spec.Volumes = append(pod.Spec.Volumes, corev1.Volume{
 		Name: "google-oauth",
 		VolumeSource: corev1.VolumeSource{
 			Secret: &corev1.SecretVolumeSource{
@@ -1429,9 +1327,9 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		},
 	})
 	// Mount to the ambient-code-runner container
-	for i := range job.Spec.Template.Spec.Containers {
-		if job.Spec.Template.Spec.Containers[i].Name == "ambient-code-runner" {
-			job.Spec.Template.Spec.Containers[i].VolumeMounts = append(job.Spec.Template.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{
+	for i := range pod.Spec.Containers {
+		if pod.Spec.Containers[i].Name == "ambient-code-runner" {
+			pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, corev1.VolumeMount{
 				Name:      "google-oauth",
 				MountPath: "/app/.google_workspace_mcp/credentials",
 				ReadOnly:  true,
@@ -1443,19 +1341,19 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 
 	// Do not mount runner Secret volume; runner fetches tokens on demand
 
-	// Create the job
-	createdJob, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Create(context.TODO(), job, v1.CreateOptions{})
+	// Create the pod
+	createdPod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Create(context.TODO(), pod, v1.CreateOptions{})
 	if err != nil {
-		// If job already exists, this is likely a race condition from duplicate watch events - not an error
+		// If pod already exists, this is likely a race condition from duplicate watch events - not an error
 		if errors.IsAlreadyExists(err) {
-			log.Printf("Job %s already exists (race condition), continuing", jobName)
-			// Clear desired-phase annotation since job exists
+			log.Printf("Pod %s already exists (race condition), continuing", podName)
+			// Clear desired-phase annotation since pod exists
 			_ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase")
 			return nil
 		}
-		log.Printf("Failed to create job %s: %v", jobName, err)
+		log.Printf("Failed to create pod %s: %v", podName, err)
 		statusPatch.AddCondition(conditionUpdate{
-			Type:    conditionJobCreated,
+			Type:    conditionPodCreated,
 			Status:  "False",
 			Reason:  "CreateFailed",
 			Message: err.Error(),
@@ -1463,54 +1361,54 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		statusPatch.AddCondition(conditionUpdate{
 			Type:    conditionReady,
 			Status:  "False",
-			Reason:  "JobCreationFailed",
-			Message: "Runner job creation failed",
+			Reason:  "PodCreationFailed",
+			Message: "Runner pod creation failed",
 		})
 		_ = statusPatch.Apply()
-		return fmt.Errorf("failed to create job: %v", err)
+		return fmt.Errorf("failed to create pod: %v", err)
 	}
 
-	log.Printf("Created job %s for AgenticSession %s", jobName, name)
+	log.Printf("Created pod %s for AgenticSession %s", podName, name)
 	statusPatch.SetField("phase", "Creating")
 	statusPatch.SetField("observedGeneration", currentObj.GetGeneration())
 	statusPatch.AddCondition(conditionUpdate{
-		Type:    conditionJobCreated,
+		Type:    conditionPodCreated,
 		Status:  "True",
-		Reason:  "JobCreated",
-		Message: "Runner job created",
+		Reason:  "PodCreated",
+		Message: "Runner pod created",
 	})
 	// Apply all accumulated status changes in a single API call
 	if err := statusPatch.Apply(); err != nil {
 		log.Printf("Warning: failed to apply status patch: %v", err)
 	}
 
-	// Clear desired-phase annotation now that job is created
+	// Clear desired-phase annotation now that pod is created
 	// (This was deferred from the restart handler to avoid race conditions with stale events)
 	_ = clearAnnotation(sessionNamespace, name, "ambient-code.io/desired-phase")
-	log.Printf("[DesiredPhase] Cleared desired-phase annotation after successful job creation")
+	log.Printf("[DesiredPhase] Cleared desired-phase annotation after successful pod creation")
 
-	// Create a per-job Service pointing to the content container
+	// Create a per-pod Service pointing to the content container
 	svc := &corev1.Service{
 		ObjectMeta: v1.ObjectMeta{
 			Name:      fmt.Sprintf("ambient-content-%s", name),
 			Namespace: sessionNamespace,
 			Labels:    map[string]string{"app": "ambient-code-runner", "agentic-session": name},
 			OwnerReferences: []v1.OwnerReference{{
-				APIVersion: "batch/v1",
-				Kind:       "Job",
-				Name:       jobName,
-				UID:        createdJob.UID,
+				APIVersion: "v1",
+				Kind:       "Pod",
+				Name:       podName,
+				UID:        createdPod.UID,
 				Controller: boolPtr(true),
 			}},
 		},
 		Spec: corev1.ServiceSpec{
-			Selector: map[string]string{"job-name": jobName},
+			Selector: map[string]string{"agentic-session": name, "app": "ambient-code-runner"},
 			Ports:    []corev1.ServicePort{{Port: 8080, TargetPort: intstr.FromString("http"), Protocol: corev1.ProtocolTCP, Name: "http"}},
 			Type:     corev1.ServiceTypeClusterIP,
 		},
 	}
 	if _, serr := config.K8sClient.CoreV1().Services(sessionNamespace).Create(context.TODO(), svc, v1.CreateOptions{}); serr != nil && !errors.IsAlreadyExists(serr) {
-		log.Printf("Failed to create per-job content service for %s: %v", name, serr)
+		log.Printf("Failed to create per-pod content service for %s: %v", name, serr)
 	}
 
 	// Create AG-UI Service pointing to the runner's FastAPI server
@@ -1524,16 +1422,16 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 				"agentic-session": name,
 			},
 			OwnerReferences: []v1.OwnerReference{{
-				APIVersion: "batch/v1",
-				Kind:       "Job",
-				Name:       jobName,
-				UID:        createdJob.UID,
+				APIVersion: "v1",
+				Kind:       "Pod",
+				Name:       podName,
+				UID:        createdPod.UID,
 				Controller: boolPtr(true),
 			}},
 		},
 		Spec: corev1.ServiceSpec{
 			Type:     corev1.ServiceTypeClusterIP,
-			Selector: map[string]string{"job-name": jobName},
+			Selector: map[string]string{"agentic-session": name, "app": "ambient-code-runner"},
 			Ports: []corev1.ServicePort{{
 				Name:       "agui",
 				Protocol:   corev1.ProtocolTCP,
@@ -1548,17 +1446,17 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		log.Printf("Created AG-UI service session-%s for AgenticSession %s", name, name)
 	}
 
-	// Start monitoring the job (only if not already being monitored)
-	monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName)
-	monitoredJobsMu.Lock()
-	alreadyMonitoring := monitoredJobs[monitorKey]
+	// Start monitoring the pod (only if not already being monitored)
+	monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName)
+	monitoredPodsMu.Lock()
+	alreadyMonitoring := monitoredPods[monitorKey]
 	if !alreadyMonitoring {
-		monitoredJobs[monitorKey] = true
-		monitoredJobsMu.Unlock()
-		go monitorJob(jobName, name, sessionNamespace)
+		monitoredPods[monitorKey] = true
+		monitoredPodsMu.Unlock()
+		go monitorPod(podName, name, sessionNamespace)
 	} else {
-		monitoredJobsMu.Unlock()
-		log.Printf("Job %s already being monitored, skipping duplicate goroutine", jobName)
+		monitoredPodsMu.Unlock()
+		log.Printf("Pod %s already being monitored, skipping duplicate goroutine", podName)
 	}
 
 	return nil
@@ -1834,18 +1732,18 @@ func reconcileActiveWorkflowWithPatch(sessionNamespace, sessionName string, spec
 	return nil
 }
 
-func monitorJob(jobName, sessionName, sessionNamespace string) {
-	monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, jobName)
+func monitorPod(podName, sessionName, sessionNamespace string) {
+	monitorKey := fmt.Sprintf("%s/%s", sessionNamespace, podName)
 
 	// Remove from monitoring map when this goroutine exits
 	defer func() {
-		monitoredJobsMu.Lock()
-		delete(monitoredJobs, monitorKey)
-		monitoredJobsMu.Unlock()
-		log.Printf("Stopped monitoring job %s (goroutine exiting)", jobName)
+		monitoredPodsMu.Lock()
+		delete(monitoredPods, monitorKey)
+		monitoredPodsMu.Unlock()
+		log.Printf("Stopped monitoring pod %s (goroutine exiting)", podName)
 	}()
 
-	log.Printf("Starting job monitoring for %s (session: %s/%s)", jobName, sessionNamespace, sessionName)
+	log.Printf("Starting pod monitoring for %s (session: %s/%s)", podName, sessionNamespace, sessionName)
 	ticker := time.NewTicker(5 * time.Second)
 	defer ticker.Stop()
 
@@ -1868,7 +1766,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) {
 		sessionStatus, _, _ := unstructured.NestedMap(sessionObj.Object, "status")
 		if sessionStatus != nil {
 			if currentPhase, ok := sessionStatus["phase"].(string); ok && currentPhase == "Stopped" {
-				log.Printf("AgenticSession %s was stopped; stopping job monitoring", sessionName)
+				log.Printf("AgenticSession %s was stopped; stopping pod monitoring", sessionName)
 				return
 			}
 		}
@@ -1877,79 +1775,97 @@ func monitorJob(jobName, sessionName, sessionNamespace string) {
 			log.Printf("Failed to refresh runner token for %s/%s: %v", sessionNamespace, sessionName, err)
 		}
 
-		job, err := config.K8sClient.BatchV1().Jobs(sessionNamespace).Get(context.TODO(), jobName, v1.GetOptions{})
+		pod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), podName, v1.GetOptions{})
 		if err != nil {
 			if errors.IsNotFound(err) {
-				log.Printf("Job %s deleted; stopping monitor", jobName)
+				log.Printf("Pod %s deleted; stopping monitor", podName)
 				return
 			}
-			log.Printf("Error fetching job %s: %v", jobName, err)
+			log.Printf("Error fetching pod %s: %v", podName, err)
 			continue
 		}
+		// Note: We don't store pod name in status (pods are ephemeral, can be recreated)
+		// Use k8s-resources endpoint or kubectl for live pod info
 
-		pods, err := config.K8sClient.CoreV1().Pods(sessionNamespace).List(context.TODO(), v1.ListOptions{LabelSelector: fmt.Sprintf("job-name=%s", jobName)})
-		if err != nil {
-			log.Printf("Failed to list pods for job %s: %v", jobName, err)
-			continue
+		if pod.Spec.NodeName != "" {
+			statusPatch.AddCondition(conditionUpdate{Type: conditionPodScheduled, Status: "True", Reason: "Scheduled", Message: fmt.Sprintf("Scheduled on %s", pod.Spec.NodeName)})
 		}
 
-		if job.Status.Succeeded > 0 {
+		if pod.Status.Phase == corev1.PodSucceeded {
 			statusPatch.SetField("phase", "Completed")
 			statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
 			statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "Completed", Message: "Session finished"})
 			_ = statusPatch.Apply()
 			_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-			_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
+			_ = deletePodAndPerPodService(sessionNamespace, podName, sessionName)
 			return
 		}
 
-		if job.Spec.BackoffLimit != nil && job.Status.Failed >= *job.Spec.BackoffLimit {
-			statusPatch.SetField("phase", "Failed")
-			statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
-			statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "BackoffLimitExceeded", Message: "Runner failed repeatedly"})
-			_ = statusPatch.Apply()
-			_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-			_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
-			return
-		}
+		if pod.Status.Phase == corev1.PodFailed {
+			// Collect detailed error message from pod and containers
+			errorMsg := pod.Status.Message
+			if errorMsg == "" {
+				errorMsg = pod.Status.Reason
+			}
 
-		if len(pods.Items) == 0 {
-			if job.Status.Active == 0 && job.Status.Succeeded == 0 && job.Status.Failed == 0 {
-				statusPatch.SetField("phase", "Failed")
-				statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
-				statusPatch.AddCondition(conditionUpdate{
-					Type:    conditionReady,
-					Status:  "False",
-					Reason:  "PodMissing",
-					Message: "Runner pod missing",
-				})
-				_ = statusPatch.Apply()
-				_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-				_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
-				return
+			// Check init containers for errors
+			for _, initStatus := range pod.Status.InitContainerStatuses {
+				if initStatus.State.Terminated != nil && initStatus.State.Terminated.ExitCode != 0 {
+					msg := fmt.Sprintf("Init container %s failed (exit %d): %s",
+						initStatus.Name,
+						initStatus.State.Terminated.ExitCode,
+						initStatus.State.Terminated.Message)
+					if initStatus.State.Terminated.Reason != "" {
+						msg = fmt.Sprintf("%s - %s", msg, initStatus.State.Terminated.Reason)
+					}
+					errorMsg = msg
+					break
+				}
+				if initStatus.State.Waiting != nil && initStatus.State.Waiting.Reason != "" {
+					errorMsg = fmt.Sprintf("Init container %s: %s - %s",
+						initStatus.Name,
+						initStatus.State.Waiting.Reason,
+						initStatus.State.Waiting.Message)
+					break
+				}
 			}
-			continue
-		}
 
-		pod := pods.Items[0]
-		// Note: We don't store pod name in status (pods are ephemeral, can be recreated)
-		// Use k8s-resources endpoint or kubectl for live pod info
+			// Check main containers for errors if init passed
+			if errorMsg == "" || errorMsg == "PodFailed" {
+				for _, containerStatus := range pod.Status.ContainerStatuses {
+					if containerStatus.State.Terminated != nil && containerStatus.State.Terminated.ExitCode != 0 {
+						errorMsg = fmt.Sprintf("Container %s failed (exit %d): %s - %s",
+							containerStatus.Name,
+							containerStatus.State.Terminated.ExitCode,
+							containerStatus.State.Terminated.Reason,
+							containerStatus.State.Terminated.Message)
+						break
+					}
+					if containerStatus.State.Waiting != nil {
+						errorMsg = fmt.Sprintf("Container %s: %s - %s",
+							containerStatus.Name,
+							containerStatus.State.Waiting.Reason,
+							containerStatus.State.Waiting.Message)
+						break
+					}
+				}
+			}
 
-		if pod.Spec.NodeName != "" {
-			statusPatch.AddCondition(conditionUpdate{Type: conditionPodScheduled, Status: "True", Reason: "Scheduled", Message: fmt.Sprintf("Scheduled on %s", pod.Spec.NodeName)})
-		}
+			if errorMsg == "" {
+				errorMsg = "Pod failed with unknown error"
+			}
 
-		if pod.Status.Phase == corev1.PodFailed {
+			log.Printf("Pod %s failed: %s", podName, errorMsg)
 			statusPatch.SetField("phase", "Failed")
 			statusPatch.SetField("completionTime", time.Now().UTC().Format(time.RFC3339))
-			statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "PodFailed", Message: pod.Status.Message})
+			statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: "PodFailed", Message: errorMsg})
 			_ = statusPatch.Apply()
 			_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-			_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
+			_ = deletePodAndPerPodService(sessionNamespace, podName, sessionName)
 			return
 		}
 
-		runner := getContainerStatusByName(&pod, "ambient-code-runner")
+		runner := getContainerStatusByName(pod, "ambient-code-runner")
 		if runner == nil {
 			// Apply any accumulated changes (e.g., PodScheduled) before continuing
 			_ = statusPatch.Apply()
@@ -1974,7 +1890,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) {
 				statusPatch.AddCondition(conditionUpdate{Type: conditionReady, Status: "False", Reason: waiting.Reason, Message: msg})
 				_ = statusPatch.Apply()
 				_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-				_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
+				_ = deletePodAndPerPodService(sessionNamespace, podName, sessionName)
 				return
 			}
 		}
@@ -2008,7 +1924,7 @@ func monitorJob(jobName, sessionName, sessionNamespace string) {
 
 			_ = statusPatch.Apply()
 			_ = ensureSessionIsInteractive(sessionNamespace, sessionName)
-			_ = deleteJobAndPerJobService(sessionNamespace, jobName, sessionName)
+			_ = deletePodAndPerPodService(sessionNamespace, podName, sessionName)
 			return
 		}
 
@@ -2027,31 +1943,101 @@ func getContainerStatusByName(pod *corev1.Pod, name string) *corev1.ContainerSta
 	return nil
 }
 
+// getS3ConfigForProject reads S3 configuration from project's integration secret
+// Falls back to operator defaults if not configured
+func getS3ConfigForProject(namespace string, appConfig *config.Config) (endpoint, bucket, accessKey, secretKey string, err error) {
+	// Try to read from project's ambient-non-vertex-integrations secret
+	secret, err := config.K8sClient.CoreV1().Secrets(namespace).Get(context.TODO(), "ambient-non-vertex-integrations", v1.GetOptions{})
+	if err != nil && !errors.IsNotFound(err) {
+		return "", "", "", "", fmt.Errorf("failed to read project secret: %w", err)
+	}
+
+	// Read from project secret if available
+	storageMode := "shared" // Default to shared cluster storage
+	if secret != nil && secret.Data != nil {
+		// Check storage mode (shared vs custom)
+		if mode := string(secret.Data["STORAGE_MODE"]); mode != "" {
+			storageMode = mode
+		}
+
+		// Only read custom S3 settings if in custom mode
+		if storageMode == "custom" {
+			if val := string(secret.Data["S3_ENDPOINT"]); val != "" {
+				endpoint = val
+			}
+			if val := string(secret.Data["S3_BUCKET"]); val != "" {
+				bucket = val
+			}
+			if val := string(secret.Data["S3_ACCESS_KEY"]); val != "" {
+				accessKey = val
+			}
+			if val := string(secret.Data["S3_SECRET_KEY"]); val != "" {
+				secretKey = val
+			}
+			log.Printf("Using custom S3 configuration for project %s", namespace)
+		} else {
+			log.Printf("Using shared cluster storage (MinIO) for project %s", namespace)
+		}
+	}
+
+	// Use operator defaults (for shared mode or as fallback)
+	if endpoint == "" {
+		endpoint = appConfig.S3Endpoint
+	}
+	if bucket == "" {
+		bucket = appConfig.S3Bucket
+	}
+
+	// If credentials still empty AND using default endpoint/bucket, use shared MinIO credentials
+	// This implements "shared cluster storage" mode where users don't need to configure anything
+	usingDefaults := endpoint == appConfig.S3Endpoint && bucket == appConfig.S3Bucket
+	if (accessKey == "" || secretKey == "") && usingDefaults {
+		// Look for minio-credentials secret in operator namespace
+		minioSecret, err := config.K8sClient.CoreV1().Secrets(appConfig.BackendNamespace).Get(context.TODO(), "minio-credentials", v1.GetOptions{})
+		if err == nil && minioSecret.Data != nil {
+			if accessKey == "" {
+				accessKey = string(minioSecret.Data["access-key"])
+			}
+			if secretKey == "" {
+				secretKey = string(minioSecret.Data["secret-key"])
+			}
+			log.Printf("Using shared MinIO credentials for project %s (shared cluster storage mode)", namespace)
+		} else {
+			log.Printf("Warning: minio-credentials secret not found in namespace %s", appConfig.BackendNamespace)
+		}
+	}
+
+	// Validate we have required config
+	if endpoint == "" || bucket == "" {
+		return "", "", "", "", fmt.Errorf("incomplete S3 configuration - endpoint and bucket required")
+	}
+	if accessKey == "" || secretKey == "" {
+		return "", "", "", "", fmt.Errorf("incomplete S3 configuration - access key and secret key required")
+	}
+
+	log.Printf("S3 config for project %s: endpoint=%s, bucket=%s", namespace, endpoint, bucket)
+	return endpoint, bucket, accessKey, secretKey, nil
+}
+
 // deleteJobAndPerJobService deletes the Job and its associated per-job Service
-func deleteJobAndPerJobService(namespace, jobName, sessionName string) error {
-	// Delete Service first (it has ownerRef to Job, but delete explicitly just in case)
+func deletePodAndPerPodService(namespace, podName, sessionName string) error {
+	// Delete Service first (it has ownerRef to Pod, but delete explicitly just in case)
 	svcName := fmt.Sprintf("ambient-content-%s", sessionName)
 	if err := config.K8sClient.CoreV1().Services(namespace).Delete(context.TODO(), svcName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-		log.Printf("Failed to delete per-job service %s/%s: %v", namespace, svcName, err)
+		log.Printf("Failed to delete per-pod service %s/%s: %v", namespace, svcName, err)
 	}
 
-	// Delete the Job with background propagation
-	policy := v1.DeletePropagationBackground
-	if err := config.K8sClient.BatchV1().Jobs(namespace).Delete(context.TODO(), jobName, v1.DeleteOptions{PropagationPolicy: &policy}); err != nil && !errors.IsNotFound(err) {
-		log.Printf("Failed to delete job %s/%s: %v", namespace, jobName, err)
-		return err
+	// Delete AG-UI service
+	aguiSvcName := fmt.Sprintf("session-%s", sessionName)
+	if err := config.K8sClient.CoreV1().Services(namespace).Delete(context.TODO(), aguiSvcName, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
+		log.Printf("Failed to delete AG-UI service %s/%s: %v", namespace, aguiSvcName, err)
 	}
 
-	// Proactively delete Pods for this Job
-	if pods, err := config.K8sClient.CoreV1().Pods(namespace).List(context.TODO(), v1.ListOptions{LabelSelector: fmt.Sprintf("job-name=%s", jobName)}); err == nil {
-		for i := range pods.Items {
-			p := pods.Items[i]
-			if err := config.K8sClient.CoreV1().Pods(namespace).Delete(context.TODO(), p.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-				log.Printf("Failed to delete pod %s/%s for job %s: %v", namespace, p.Name, jobName, err)
-			}
-		}
-	} else if !errors.IsNotFound(err) {
-		log.Printf("Failed to list pods for job %s/%s: %v", namespace, jobName, err)
+	// Delete the Pod with background propagation
+	policy := v1.DeletePropagationBackground
+	if err := config.K8sClient.CoreV1().Pods(namespace).Delete(context.TODO(), podName, v1.DeleteOptions{PropagationPolicy: &policy}); err != nil && !errors.IsNotFound(err) {
+		log.Printf("Failed to delete pod %s/%s: %v", namespace, podName, err)
+		return err
 	}
 
 	// Delete the ambient-vertex secret if it was copied by the operator
@@ -2076,90 +2062,6 @@ func deleteJobAndPerJobService(namespace, jobName, sessionName string) error {
 	return nil
 }
 
-// CleanupExpiredTempContentPods removes temporary content pods that have exceeded their TTL
-func CleanupExpiredTempContentPods() {
-	log.Println("Starting temp content pod cleanup goroutine")
-	for {
-		time.Sleep(1 * time.Minute)
-
-		// List all temp content pods across all namespaces
-		pods, err := config.K8sClient.CoreV1().Pods("").List(context.TODO(), v1.ListOptions{
-			LabelSelector: "app=temp-content-service",
-		})
-		if err != nil {
-			log.Printf("[TempPodCleanup] Failed to list temp content pods: %v", err)
-			continue
-		}
-
-		gvr := types.GetAgenticSessionResource()
-		for _, pod := range pods.Items {
-			sessionName := pod.Labels["agentic-session"]
-			if sessionName == "" {
-				log.Printf("[TempPodCleanup] Temp pod %s has no agentic-session label, skipping", pod.Name)
-				continue
-			}
-
-			// Check if session still exists
-			session, err := config.DynamicClient.Resource(gvr).Namespace(pod.Namespace).Get(context.TODO(), sessionName, v1.GetOptions{})
-			if err != nil {
-				if errors.IsNotFound(err) {
-					// Session deleted, delete temp pod
-					log.Printf("[TempPodCleanup] Session %s/%s gone, deleting orphaned temp pod %s", pod.Namespace, sessionName, pod.Name)
-					if err := config.K8sClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-						log.Printf("[TempPodCleanup] Failed to delete orphaned temp pod: %v", err)
-					}
-				}
-				continue
-			}
-
-			// Get last-accessed timestamp from session annotation
-			annotations := session.GetAnnotations()
-			lastAccessedStr := annotations[tempContentLastAccessedAnnotation]
-			if lastAccessedStr == "" {
-				// Fall back to pod created-at if no last-accessed
-				lastAccessedStr = pod.Annotations["ambient-code.io/created-at"]
-			}
-
-			if lastAccessedStr == "" {
-				log.Printf("[TempPodCleanup] No timestamp for temp pod %s, skipping", pod.Name)
-				continue
-			}
-
-			lastAccessed, err := time.Parse(time.RFC3339, lastAccessedStr)
-			if err != nil {
-				log.Printf("[TempPodCleanup] Failed to parse timestamp for pod %s: %v", pod.Name, err)
-				continue
-			}
-
-			// Delete if inactive for > 10 minutes
-			if time.Since(lastAccessed) > tempContentInactivityTTL {
-				log.Printf("[TempPodCleanup] Deleting inactive temp pod %s/%s (last accessed: %v ago)",
-					pod.Namespace, pod.Name, time.Since(lastAccessed))
-
-				if err := config.K8sClient.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, v1.DeleteOptions{}); err != nil && !errors.IsNotFound(err) {
-					log.Printf("[TempPodCleanup] Failed to delete temp pod: %v", err)
-					continue
-				}
-
-				// Update condition
-				_ = mutateAgenticSessionStatus(pod.Namespace, sessionName, func(status map[string]interface{}) {
-					setCondition(status, conditionUpdate{
-						Type:    conditionTempContentPodReady,
-						Status:  "False",
-						Reason:  "Expired",
-						Message: fmt.Sprintf("Temp pod deleted due to inactivity (%v)", time.Since(lastAccessed)),
-					})
-				})
-
-				// Clear temp-content-requested annotation
-				delete(annotations, tempContentRequestedAnnotation)
-				delete(annotations, tempContentLastAccessedAnnotation)
-				_ = updateAnnotations(pod.Namespace, sessionName, annotations)
-			}
-		}
-	}
-}
-
 // copySecretToNamespace copies a secret to a target namespace with owner references
 func copySecretToNamespace(ctx context.Context, sourceSecret *corev1.Secret, targetNamespace string, ownerObj *unstructured.Unstructured) error {
 	// Check if secret already exists in target namespace
@@ -2326,137 +2228,6 @@ func deleteAmbientLangfuseSecret(ctx context.Context, namespace string) error {
 	return nil
 }
 
-// reconcileTempContentPodWithPatch is a version of reconcileTempContentPod that uses StatusPatch for batched updates.
-func reconcileTempContentPodWithPatch(sessionNamespace, sessionName, tempPodName string, session *unstructured.Unstructured, statusPatch *StatusPatch) error {
-	// Check if pod already exists
-	tempPod, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Get(context.TODO(), tempPodName, v1.GetOptions{})
-
-	if errors.IsNotFound(err) {
-		// Create temp pod
-		log.Printf("[TempPod] Creating temp content pod for workspace access: %s/%s", sessionNamespace, tempPodName)
-
-		pvcName := fmt.Sprintf("ambient-workspace-%s", sessionName)
-		appConfig := config.LoadConfig()
-
-		pod := &corev1.Pod{
-			ObjectMeta: v1.ObjectMeta{
-				Name:      tempPodName,
-				Namespace: sessionNamespace,
-				Labels: map[string]string{
-					"app":             "temp-content-service",
-					"agentic-session": sessionName,
-				},
-				Annotations: map[string]string{
-					"ambient-code.io/created-at": time.Now().UTC().Format(time.RFC3339),
-				},
-				OwnerReferences: []v1.OwnerReference{{
-					APIVersion: session.GetAPIVersion(),
-					Kind:       session.GetKind(),
-					Name:       session.GetName(),
-					UID:        session.GetUID(),
-					Controller: boolPtr(true),
-				}},
-			},
-			Spec: corev1.PodSpec{
-				RestartPolicy:                 corev1.RestartPolicyNever,
-				TerminationGracePeriodSeconds: int64Ptr(0), // Enable instant termination
-				Containers: []corev1.Container{{
-					Name:            "content",
-					Image:           appConfig.ContentServiceImage,
-					ImagePullPolicy: appConfig.ImagePullPolicy,
-					Env: []corev1.EnvVar{
-						{Name: "CONTENT_SERVICE_MODE", Value: "true"},
-						{Name: "STATE_BASE_DIR", Value: "/workspace"},
-					},
-					Ports: []corev1.ContainerPort{{ContainerPort: 8080, Name: "http"}},
-					VolumeMounts: []corev1.VolumeMount{{
-						Name:      "workspace",
-						MountPath: "/workspace",
-					}},
-					ReadinessProbe: &corev1.Probe{
-						ProbeHandler: corev1.ProbeHandler{
-							HTTPGet: &corev1.HTTPGetAction{
-								Path: "/health",
-								Port: intstr.FromString("http"),
-							},
-						},
-						InitialDelaySeconds: 3,
-						PeriodSeconds:       3,
-					},
-				}},
-				Volumes: []corev1.Volume{{
-					Name: "workspace",
-					VolumeSource: corev1.VolumeSource{
-						PersistentVolumeClaim: &corev1.PersistentVolumeClaimVolumeSource{
-							ClaimName: pvcName,
-						},
-					},
-				}},
-			},
-		}
-
-		if _, err := config.K8sClient.CoreV1().Pods(sessionNamespace).Create(context.TODO(), pod, v1.CreateOptions{}); err != nil {
-			log.Printf("[TempPod] Failed to create temp pod: %v", err)
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionTempContentPodReady,
-				Status:  "False",
-				Reason:  "CreationFailed",
-				Message: fmt.Sprintf("Failed to create temp pod: %v", err),
-			})
-			return fmt.Errorf("failed to create temp pod: %w", err)
-		}
-
-		log.Printf("[TempPod] Created temp pod %s", tempPodName)
-		statusPatch.AddCondition(conditionUpdate{
-			Type:    conditionTempContentPodReady,
-			Status:  "Unknown",
-			Reason:  "Provisioning",
-			Message: "Temp content pod starting",
-		})
-		return nil
-	}
-
-	if err != nil {
-		return fmt.Errorf("failed to check temp pod: %w", err)
-	}
-
-	// Temp pod exists, check readiness
-	if tempPod.Status.Phase == corev1.PodRunning {
-		ready := false
-		for _, cond := range tempPod.Status.Conditions {
-			if cond.Type == corev1.PodReady && cond.Status == corev1.ConditionTrue {
-				ready = true
-				break
-			}
-		}
-
-		if ready {
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionTempContentPodReady,
-				Status:  "True",
-				Reason:  "Ready",
-				Message: "Temp content pod is ready for workspace access",
-			})
-		} else {
-			statusPatch.AddCondition(conditionUpdate{
-				Type:    conditionTempContentPodReady,
-				Status:  "Unknown",
-				Reason:  "NotReady",
-				Message: "Temp content pod not ready yet",
-			})
-		}
-	} else if tempPod.Status.Phase == corev1.PodFailed {
-		statusPatch.AddCondition(conditionUpdate{
-			Type:    conditionTempContentPodReady,
-			Status:  "False",
-			Reason:  "PodFailed",
-			Message: fmt.Sprintf("Temp content pod failed: %s", tempPod.Status.Message),
-		})
-	}
-
-	return nil
-}
-
 // LEGACY: getBackendAPIURL removed - AG-UI migration
 // Workflow and repo changes now call runner's REST endpoints directly
 
@@ -2632,6 +2403,5 @@ func regenerateRunnerToken(sessionNamespace, sessionName string, session *unstru
 // Helper functions
 var (
 	boolPtr  = func(b bool) *bool { return &b }
-	int32Ptr = func(i int32) *int32 { return &i }
 	int64Ptr = func(i int64) *int64 { return &i }
 )
diff --git a/components/operator/internal/services/infrastructure.go b/components/operator/internal/services/infrastructure.go
index bed30920a..e33481f89 100644
--- a/components/operator/internal/services/infrastructure.go
+++ b/components/operator/internal/services/infrastructure.go
@@ -51,36 +51,10 @@ func EnsureContentService(namespace string) error {
 	return nil
 }
 
-// EnsureSessionWorkspacePVC creates a per-session PVC owned by the AgenticSession to avoid multi-attach conflicts
+// EnsureSessionWorkspacePVC is deprecated - sessions now use EmptyDir with S3 state persistence
+// Kept for backward compatibility but returns nil immediately
 func EnsureSessionWorkspacePVC(namespace, pvcName string, ownerRefs []v1.OwnerReference) error {
-	// Check if PVC exists
-	if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(namespace).Get(context.TODO(), pvcName, v1.GetOptions{}); err == nil {
-		return nil
-	} else if !errors.IsNotFound(err) {
-		return err
-	}
-
-	pvc := &corev1.PersistentVolumeClaim{
-		ObjectMeta: v1.ObjectMeta{
-			Name:            pvcName,
-			Namespace:       namespace,
-			Labels:          map[string]string{"app": "ambient-workspace", "agentic-session": pvcName},
-			OwnerReferences: ownerRefs,
-		},
-		Spec: corev1.PersistentVolumeClaimSpec{
-			AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
-			Resources: corev1.VolumeResourceRequirements{
-				Requests: corev1.ResourceList{
-					corev1.ResourceStorage: resource.MustParse("5Gi"),
-				},
-			},
-		},
-	}
-	if _, err := config.K8sClient.CoreV1().PersistentVolumeClaims(namespace).Create(context.TODO(), pvc, v1.CreateOptions{}); err != nil {
-		if errors.IsAlreadyExists(err) {
-			return nil
-		}
-		return err
-	}
+	// DEPRECATED: Per-session PVCs have been replaced with EmptyDir + S3 state sync
+	// This function is kept for backward compatibility but does nothing
 	return nil
 }
diff --git a/components/operator/main.go b/components/operator/main.go
index df9c31821..c71c12709 100644
--- a/components/operator/main.go
+++ b/components/operator/main.go
@@ -1,16 +1,28 @@
 package main
 
 import (
+	"context"
+	"flag"
 	"log"
 	"os"
+	"strconv"
+
+	"k8s.io/apimachinery/pkg/runtime"
+	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
+	clientgoscheme "k8s.io/client-go/kubernetes/scheme"
+	ctrl "sigs.k8s.io/controller-runtime"
+	"sigs.k8s.io/controller-runtime/pkg/healthz"
+	ctrllog "sigs.k8s.io/controller-runtime/pkg/log"
+	"sigs.k8s.io/controller-runtime/pkg/log/zap"
+	metricsserver "sigs.k8s.io/controller-runtime/pkg/metrics/server"
 
 	"ambient-code-operator/internal/config"
+	"ambient-code-operator/internal/controller"
 	"ambient-code-operator/internal/handlers"
 	"ambient-code-operator/internal/preflight"
 )
 
 // Build-time metadata (set via -ldflags -X during build)
-// These are embedded directly in the binary, so they're always accurate
 var (
 	GitCommit  = "unknown"
 	GitBranch  = "unknown"
@@ -18,49 +30,157 @@ var (
 	BuildDate  = "unknown"
 )
 
-func logBuildInfo() {
-	log.Println("==============================================")
-	log.Println("Agentic Session Operator - Build Information")
-	log.Println("==============================================")
-	log.Printf("Version:     %s", GitVersion)
-	log.Printf("Commit:      %s", GitCommit)
-	log.Printf("Branch:      %s", GitBranch)
-	log.Printf("Repository:  %s", getEnvOrDefault("GIT_REPO", "unknown"))
-	log.Printf("Built:       %s", BuildDate)
-	log.Printf("Built by:    %s", getEnvOrDefault("BUILD_USER", "unknown"))
-	log.Println("==============================================")
-}
+var (
+	scheme = runtime.NewScheme()
+)
 
-func getEnvOrDefault(key, defaultValue string) string {
-	if value := os.Getenv(key); value != "" {
-		return value
-	}
-	return defaultValue
+func init() {
+	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
 }
 
 func main() {
+	// Parse command line flags
+	var metricsAddr string
+	var enableLeaderElection bool
+	var probeAddr string
+	var maxConcurrentReconciles int
+	var useLegacyWatch bool
+
+	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
+	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
+	flag.BoolVar(&enableLeaderElection, "leader-elect", false,
+		"Enable leader election for controller manager. "+
+			"Enabling this will ensure there is only one active controller manager.")
+	flag.IntVar(&maxConcurrentReconciles, "max-concurrent-reconciles", 10,
+		"Maximum number of concurrent Reconciles which can be run. Higher values allow more throughput but consume more resources.")
+	flag.BoolVar(&useLegacyWatch, "legacy-watch", false,
+		"Use legacy watch-based implementation instead of controller-runtime (for debugging only).")
+	flag.Parse()
+
+	// Allow environment variable override for max concurrent reconciles
+	if envVal := os.Getenv("MAX_CONCURRENT_RECONCILES"); envVal != "" {
+		if v, err := strconv.Atoi(envVal); err == nil && v > 0 {
+			maxConcurrentReconciles = v
+		}
+	}
+
+	// Set up logging
+	opts := zap.Options{
+		Development: os.Getenv("DEV_MODE") == "true",
+	}
+	ctrllog.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
+
+	logger := ctrllog.Log.WithName("setup")
+
 	// Log build information
 	logBuildInfo()
+	logger.Info("Starting Agentic Session Operator",
+		"maxConcurrentReconciles", maxConcurrentReconciles,
+		"leaderElection", enableLeaderElection,
+		"legacyWatch", useLegacyWatch,
+	)
 
-	// Initialize Kubernetes clients
+	// Initialize Kubernetes clients (needed for legacy handlers and config)
 	if err := config.InitK8sClients(); err != nil {
-		log.Fatalf("Failed to initialize Kubernetes clients: %v", err)
+		logger.Error(err, "Failed to initialize Kubernetes clients")
+		os.Exit(1)
 	}
 
 	// Load application configuration
 	appConfig := config.LoadConfig()
 
-	log.Printf("Agentic Session Operator starting in namespace: %s", appConfig.Namespace)
-	log.Printf("Using ambient-code runner image: %s", appConfig.AmbientCodeRunnerImage)
+	logger.Info("Configuration loaded",
+		"namespace", appConfig.Namespace,
+		"backendNamespace", appConfig.BackendNamespace,
+		"runnerImage", appConfig.AmbientCodeRunnerImage,
+	)
+
+	// Initialize OpenTelemetry metrics
+	shutdownMetrics, err := controller.InitMetrics(context.Background())
+	if err != nil {
+		logger.Error(err, "Failed to initialize OpenTelemetry metrics, continuing without metrics")
+	} else {
+		defer shutdownMetrics()
+	}
 
 	// Validate Vertex AI configuration at startup if enabled
 	if os.Getenv("CLAUDE_CODE_USE_VERTEX") == "1" {
 		if err := preflight.ValidateVertexConfig(appConfig.Namespace); err != nil {
-			log.Fatalf("Vertex AI validation failed: %v", err)
+			logger.Error(err, "Vertex AI validation failed")
+			os.Exit(1)
 		}
 	}
 
-	// Start watching AgenticSession resources
+	// If legacy watch mode is requested, use the old implementation
+	if useLegacyWatch {
+		logger.Info("Using legacy watch-based implementation")
+		runLegacyMode()
+		return
+	}
+
+	// Create controller-runtime manager with increased QPS/Burst to avoid client-side throttling
+	// Default is QPS=5, Burst=10 which causes delays when handling multiple sessions
+	restConfig := ctrl.GetConfigOrDie()
+	restConfig.QPS = 100
+	restConfig.Burst = 200
+
+	mgr, err := ctrl.NewManager(restConfig, ctrl.Options{
+		Scheme:                 scheme,
+		Metrics:                metricsserver.Options{BindAddress: metricsAddr},
+		HealthProbeBindAddress: probeAddr,
+		LeaderElection:         enableLeaderElection,
+		LeaderElectionID:       "ambient-code-operator.ambient-code.io",
+	})
+	if err != nil {
+		logger.Error(err, "Unable to create manager")
+		os.Exit(1)
+	}
+
+	// Set up AgenticSession controller with concurrent reconcilers
+	agenticSessionReconciler := controller.NewAgenticSessionReconciler(
+		mgr.GetClient(),
+		maxConcurrentReconciles,
+	)
+	if err := agenticSessionReconciler.SetupWithManager(mgr); err != nil {
+		logger.Error(err, "Unable to create AgenticSession controller")
+		os.Exit(1)
+	}
+	logger.Info("AgenticSession controller registered",
+		"maxConcurrentReconciles", maxConcurrentReconciles,
+	)
+
+	// Add health check endpoints
+	if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
+		logger.Error(err, "Unable to set up health check")
+		os.Exit(1)
+	}
+	if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
+		logger.Error(err, "Unable to set up ready check")
+		os.Exit(1)
+	}
+
+	// Start namespace and project settings watchers (these remain as watch loops for now)
+	// Note: These could be migrated to controller-runtime controllers in the future
+	go handlers.WatchNamespaces()
+	go handlers.WatchProjectSettings()
+
+	logger.Info("Starting manager with controller-runtime",
+		"maxConcurrentReconciles", maxConcurrentReconciles,
+	)
+
+	// Start the manager (blocks until stopped)
+	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
+		logger.Error(err, "Problem running manager")
+		os.Exit(1)
+	}
+}
+
+// runLegacyMode runs the operator using the old watch-based implementation.
+// This is kept for backward compatibility and debugging.
+func runLegacyMode() {
+	log.Println("=== LEGACY MODE: Using watch-based implementation ===")
+
+	// Start watching AgenticSession resources (legacy)
 	go handlers.WatchAgenticSessions()
 
 	// Start watching for managed namespaces
@@ -69,9 +189,26 @@ func main() {
 	// Start watching ProjectSettings resources
 	go handlers.WatchProjectSettings()
 
-	// Start cleanup of expired temporary content pods
-	go handlers.CleanupExpiredTempContentPods()
-
 	// Keep the operator running
 	select {}
 }
+
+func logBuildInfo() {
+	log.Println("==============================================")
+	log.Println("Agentic Session Operator - Build Information")
+	log.Println("==============================================")
+	log.Printf("Version:     %s", GitVersion)
+	log.Printf("Commit:      %s", GitCommit)
+	log.Printf("Branch:      %s", GitBranch)
+	log.Printf("Repository:  %s", getEnvOrDefault("GIT_REPO", "unknown"))
+	log.Printf("Built:       %s", BuildDate)
+	log.Printf("Built by:    %s", getEnvOrDefault("BUILD_USER", "unknown"))
+	log.Println("==============================================")
+}
+
+func getEnvOrDefault(key, defaultValue string) string {
+	if value := os.Getenv(key); value != "" {
+		return value
+	}
+	return defaultValue
+}
diff --git a/components/runners/claude-code-runner/adapter.py b/components/runners/claude-code-runner/adapter.py
index 419e493d2..ad8002f23 100644
--- a/components/runners/claude-code-runner/adapter.py
+++ b/components/runners/claude-code-runner/adapter.py
@@ -87,13 +87,11 @@ async def initialize(self, context: RunnerContext):
         # Copy Google OAuth credentials from mounted Secret to writable workspace location
         await self._setup_google_credentials()
         
-        # Prepare workspace from input repo if provided
-        async for event in self._prepare_workspace():
-            yield event
-            
-        # Initialize workflow if ACTIVE_WORKFLOW env vars are set
-        async for event in self._initialize_workflow_if_set():
-            yield event
+        # Workspace is already prepared by init container (hydrate.sh)
+        # - Repos cloned to /workspace/repos/
+        # - Workflows cloned to /workspace/workflows/
+        # - State hydrated from S3 to .claude/, artifacts/, file-uploads/
+        logger.info("Workspace prepared by init container, validating...")
             
         # Validate prerequisite files exist for phase-based commands
         try:
@@ -361,9 +359,11 @@ async def _run_claude_agent_sdk(
             )
             obs._pending_initial_prompt = prompt
 
-            # Check if continuing from previous session
-            parent_session_id = self.context.get_env('PARENT_SESSION_ID', '').strip()
-            is_continuation = bool(parent_session_id)
+            # Check if this is a resume session via IS_RESUME env var
+            # This is set by the operator when restarting a stopped/completed/failed session
+            is_continuation = self.context.get_env('IS_RESUME', '').strip().lower() == 'true'
+            if is_continuation:
+                logger.info("IS_RESUME=true - treating as continuation")
 
             # Determine cwd and additional dirs
             repos_cfg = self._get_repos_config()
@@ -898,160 +898,34 @@ async def _setup_vertex_credentials(self) -> dict:
         }
 
     async def _prepare_workspace(self) -> AsyncIterator[BaseEvent]:
-        """Clone input repo/branch into workspace and configure git remotes."""
+        """Validate workspace prepared by init container.
+        
+        The init-hydrate container now handles:
+        - Downloading state from S3 (.claude/, artifacts/, file-uploads/)
+        - Cloning repos to /workspace/repos/
+        - Cloning workflows to /workspace/workflows/
+        
+        Runner just validates and logs what's ready.
+        """
         workspace = Path(self.context.workspace_path)
-        workspace.mkdir(parents=True, exist_ok=True)
-
-        parent_session_id = self.context.get_env('PARENT_SESSION_ID', '').strip()
-        reusing_workspace = bool(parent_session_id)
-
-        logger.info(f"Workspace preparation: parent_session_id={parent_session_id[:8] if parent_session_id else 'None'}, reusing={reusing_workspace}")
-
-        repos_cfg = self._get_repos_config()
-        if repos_cfg:
-            async for event in self._prepare_multi_repo_workspace(workspace, repos_cfg, reusing_workspace):
-                yield event
-            return
-
-        # Single-repo legacy flow
-        input_repo = os.getenv("INPUT_REPO_URL", "").strip()
-        if not input_repo:
-            logger.info("No INPUT_REPO_URL configured, skipping single-repo setup")
-            return
-
-        input_branch = os.getenv("INPUT_BRANCH", "").strip() or "main"
-        output_repo = os.getenv("OUTPUT_REPO_URL", "").strip()
-
-        token = await self._fetch_token_for_url(input_repo)
-        workspace_has_git = (workspace / ".git").exists()
-
-        try:
-            if not workspace_has_git:
-                yield RawEvent(
-                    type=EventType.RAW,
-                    thread_id=self._current_thread_id or self.context.session_id,
-                    run_id=self._current_run_id or "init",
-                    event={"type": "system_log", "message": "📥 Cloning input repository..."}
-                )
-                clone_url = self._url_with_token(input_repo, token) if token else input_repo
-                await self._run_cmd(["git", "clone", "--branch", input_branch, "--single-branch", clone_url, str(workspace)], cwd=str(workspace.parent))
-                await self._run_cmd(["git", "remote", "set-url", "origin", clone_url], cwd=str(workspace), ignore_errors=True)
-            elif reusing_workspace:
-                yield RawEvent(
-                    type=EventType.RAW,
-                    thread_id=self._current_thread_id or self.context.session_id,
-                    run_id=self._current_run_id or "init",
-                    event={"type": "system_log", "message": "✓ Preserving workspace (continuation)"}
-                )
-                await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(input_repo, token) if token else input_repo], cwd=str(workspace), ignore_errors=True)
-            else:
-                yield RawEvent(
-                    type=EventType.RAW,
-                    thread_id=self._current_thread_id or self.context.session_id,
-                    run_id=self._current_run_id or "init",
-                    event={"type": "system_log", "message": "🔄 Resetting workspace to clean state"}
-                )
-                await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(input_repo, token) if token else input_repo], cwd=str(workspace))
-                await self._run_cmd(["git", "fetch", "origin", input_branch], cwd=str(workspace))
-                await self._run_cmd(["git", "checkout", input_branch], cwd=str(workspace))
-                await self._run_cmd(["git", "reset", "--hard", f"origin/{input_branch}"], cwd=str(workspace))
-
-            # Git identity
-            user_name = os.getenv("GIT_USER_NAME", "").strip() or "Ambient Code Bot"
-            user_email = os.getenv("GIT_USER_EMAIL", "").strip() or "bot@ambient-code.local"
-            await self._run_cmd(["git", "config", "user.name", user_name], cwd=str(workspace))
-            await self._run_cmd(["git", "config", "user.email", user_email], cwd=str(workspace))
-
-            if output_repo:
-                out_url = self._url_with_token(output_repo, token) if token else output_repo
-                await self._run_cmd(["git", "remote", "remove", "output"], cwd=str(workspace), ignore_errors=True)
-                await self._run_cmd(["git", "remote", "add", "output", out_url], cwd=str(workspace))
-
-        except Exception as e:
-            logger.error(f"Failed to prepare workspace: {e}")
-            yield RawEvent(
-                type=EventType.RAW,
-                thread_id=self._current_thread_id or self.context.session_id,
-                run_id=self._current_run_id or "init",
-                event={"type": "system_log", "message": f"Workspace preparation failed: {e}"}
-            )
-
-        # Create artifacts directory
-        try:
-            artifacts_dir = workspace / "artifacts"
-            artifacts_dir.mkdir(parents=True, exist_ok=True)
-        except Exception as e:
-            logger.warning(f"Failed to create artifacts directory: {e}")
-
-    async def _prepare_multi_repo_workspace(
-        self, workspace: Path, repos_cfg: list, reusing_workspace: bool
-    ) -> AsyncIterator[BaseEvent]:
-        """Prepare workspace for multi-repo mode."""
-        try:
-            for r in repos_cfg:
-                name = (r.get('name') or '').strip()
-                inp = r.get('input') or {}
-                url = (inp.get('url') or '').strip()
-                branch = (inp.get('branch') or '').strip() or 'main'
-                if not name or not url:
-                    continue
-
-                repo_dir = workspace / name
-                token = await self._fetch_token_for_url(url)
-                repo_exists = repo_dir.exists() and (repo_dir / ".git").exists()
-
-                if not repo_exists:
-                    yield RawEvent(
-                        type=EventType.RAW,
-                        thread_id=self._current_thread_id or self.context.session_id,
-                        run_id=self._current_run_id or "init",
-                        event={"type": "system_log", "message": f"📥 Cloning {name}..."}
-                    )
-                    clone_url = self._url_with_token(url, token) if token else url
-                    await self._run_cmd(["git", "clone", "--branch", branch, "--single-branch", clone_url, str(repo_dir)], cwd=str(workspace))
-                    await self._run_cmd(["git", "remote", "set-url", "origin", clone_url], cwd=str(repo_dir), ignore_errors=True)
-                elif reusing_workspace:
-                    yield RawEvent(
-                        type=EventType.RAW,
-                        thread_id=self._current_thread_id or self.context.session_id,
-                        run_id=self._current_run_id or "init",
-                        event={"type": "system_log", "message": f"✓ Preserving {name} (continuation)"}
-                    )
-                    await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(url, token) if token else url], cwd=str(repo_dir), ignore_errors=True)
-                else:
-                    yield RawEvent(
-                        type=EventType.RAW,
-                        thread_id=self._current_thread_id or self.context.session_id,
-                        run_id=self._current_run_id or "init",
-                        event={"type": "system_log", "message": f"🔄 Resetting {name} to clean state"}
-                    )
-                    await self._run_cmd(["git", "remote", "set-url", "origin", self._url_with_token(url, token) if token else url], cwd=str(repo_dir), ignore_errors=True)
-                    await self._run_cmd(["git", "fetch", "origin", branch], cwd=str(repo_dir))
-                    await self._run_cmd(["git", "checkout", branch], cwd=str(repo_dir))
-                    await self._run_cmd(["git", "reset", "--hard", f"origin/{branch}"], cwd=str(repo_dir))
-
-                # Git identity
-                user_name = os.getenv("GIT_USER_NAME", "").strip() or "Ambient Code Bot"
-                user_email = os.getenv("GIT_USER_EMAIL", "").strip() or "bot@ambient-code.local"
-                await self._run_cmd(["git", "config", "user.name", user_name], cwd=str(repo_dir))
-                await self._run_cmd(["git", "config", "user.email", user_email], cwd=str(repo_dir))
-
-                # Configure output remote
-                out = r.get('output') or {}
-                out_url_raw = (out.get('url') or '').strip()
-                if out_url_raw:
-                    out_url = self._url_with_token(out_url_raw, token) if token else out_url_raw
-                    await self._run_cmd(["git", "remote", "remove", "output"], cwd=str(repo_dir), ignore_errors=True)
-                    await self._run_cmd(["git", "remote", "add", "output", out_url], cwd=str(repo_dir))
+        logger.info(f"Validating workspace at {workspace}")
+        
+        # Check what was hydrated
+        hydrated_paths = []
+        for path_name in [".claude", "artifacts", "file-uploads"]:
+            path_dir = workspace / path_name
+            if path_dir.exists():
+                file_count = len([f for f in path_dir.rglob("*") if f.is_file()])
+                if file_count > 0:
+                    hydrated_paths.append(f"{path_name} ({file_count} files)")
+        
+        if hydrated_paths:
+            logger.info(f"Hydrated from S3: {', '.join(hydrated_paths)}")
+        else:
+            logger.info("No state hydrated (fresh session)")
+        
+        # No further preparation needed - init container did the work
 
-        except Exception as e:
-            logger.error(f"Failed to prepare multi-repo workspace: {e}")
-            yield RawEvent(
-                type=EventType.RAW,
-                thread_id=self._current_thread_id or self.context.session_id,
-                run_id=self._current_run_id or "init",
-                event={"type": "system_log", "message": f"Workspace preparation failed: {e}"}
-            )
 
     async def _validate_prerequisites(self):
         """Validate prerequisite files exist for phase-based slash commands."""
@@ -1086,14 +960,11 @@ async def _validate_prerequisites(self):
                 break
 
     async def _initialize_workflow_if_set(self) -> AsyncIterator[BaseEvent]:
-        """Initialize workflow on startup if ACTIVE_WORKFLOW env vars are set."""
+        """Validate workflow was cloned by init container."""
         active_workflow_url = (os.getenv('ACTIVE_WORKFLOW_GIT_URL') or '').strip()
         if not active_workflow_url:
             return
 
-        active_workflow_branch = (os.getenv('ACTIVE_WORKFLOW_BRANCH') or 'main').strip()
-        active_workflow_path = (os.getenv('ACTIVE_WORKFLOW_PATH') or '').strip()
-
         try:
             owner, repo, _ = self._parse_owner_repo(active_workflow_url)
             derived_name = repo or ''
@@ -1105,79 +976,24 @@ async def _initialize_workflow_if_set(self) -> AsyncIterator[BaseEvent]:
             derived_name = (derived_name or '').removesuffix('.git').strip()
 
             if not derived_name:
-                logger.warning("Could not derive workflow name from URL, skipping initialization")
+                logger.warning("Could not derive workflow name from URL")
                 return
 
-            workflow_dir = Path(self.context.workspace_path) / "workflows" / derived_name
-
-            if workflow_dir.exists():
-                logger.info(f"Workflow {derived_name} already exists, skipping initialization")
-                return
-
-            logger.info(f"Initializing workflow {derived_name} from CR spec on startup")
-            async for event in self._clone_workflow_repository(active_workflow_url, active_workflow_branch, active_workflow_path, derived_name):
-                yield event
+            # Check for cloned workflow (init container uses -clone-temp suffix)
+            workspace = Path(self.context.workspace_path)
+            workflow_temp_dir = workspace / "workflows" / f"{derived_name}-clone-temp"
+            workflow_dir = workspace / "workflows" / derived_name
+            
+            if workflow_temp_dir.exists():
+                logger.info(f"Workflow {derived_name} cloned by init container at {workflow_temp_dir.name}")
+            elif workflow_dir.exists():
+                logger.info(f"Workflow {derived_name} available at {workflow_dir.name}")
+            else:
+                logger.warning(f"Workflow {derived_name} not found (init container may have failed to clone)")
 
         except Exception as e:
-            logger.error(f"Failed to initialize workflow on startup: {e}")
+            logger.error(f"Failed to validate workflow: {e}")
 
-    async def _clone_workflow_repository(
-        self, git_url: str, branch: str, path: str, workflow_name: str
-    ) -> AsyncIterator[BaseEvent]:
-        """Clone workflow repository."""
-        workspace = Path(self.context.workspace_path)
-        workflow_dir = workspace / "workflows" / workflow_name
-        temp_clone_dir = workspace / "workflows" / f"{workflow_name}-clone-temp"
-
-        if workflow_dir.exists():
-            yield RawEvent(
-                type=EventType.RAW,
-                thread_id=self._current_thread_id or self.context.session_id,
-                run_id=self._current_run_id or "init",
-                event={"type": "system_log", "message": f"✓ Workflow {workflow_name} already loaded"}
-            )
-            return
-
-        token = await self._fetch_token_for_url(git_url)
-
-        yield RawEvent(
-            type=EventType.RAW,
-            thread_id=self._current_thread_id or self.context.session_id,
-            run_id=self._current_run_id or "init",
-            event={"type": "system_log", "message": f"📥 Cloning workflow {workflow_name}..."}
-        )
-
-        clone_url = self._url_with_token(git_url, token) if token else git_url
-        await self._run_cmd(["git", "clone", "--branch", branch, "--single-branch", clone_url, str(temp_clone_dir)], cwd=str(workspace))
-
-        if path and path.strip():
-            subdir_path = temp_clone_dir / path.strip()
-            if subdir_path.exists() and subdir_path.is_dir():
-                shutil.copytree(subdir_path, workflow_dir)
-                shutil.rmtree(temp_clone_dir)
-                yield RawEvent(
-                    type=EventType.RAW,
-                    thread_id=self._current_thread_id or self.context.session_id,
-                    run_id=self._current_run_id or "init",
-                    event={"type": "system_log", "message": f"✓ Extracted workflow from: {path}"}
-                )
-            else:
-                temp_clone_dir.rename(workflow_dir)
-                yield RawEvent(
-                    type=EventType.RAW,
-                    thread_id=self._current_thread_id or self.context.session_id,
-                    run_id=self._current_run_id or "init",
-                    event={"type": "system_log", "message": f"⚠️ Path '{path}' not found, using full repository"}
-                )
-        else:
-            temp_clone_dir.rename(workflow_dir)
-
-        yield RawEvent(
-            type=EventType.RAW,
-            thread_id=self._current_thread_id or self.context.session_id,
-            run_id=self._current_run_id or "init",
-            event={"type": "system_log", "message": f"✅ Workflow {workflow_name} ready"}
-        )
 
     async def _run_cmd(self, cmd, cwd=None, capture_stdout=False, ignore_errors=False):
         """Run a subprocess command asynchronously."""
diff --git a/components/runners/claude-code-runner/main.py b/components/runners/claude-code-runner/main.py
index 7f14b1663..afbbefaed 100644
--- a/components/runners/claude-code-runner/main.py
+++ b/components/runners/claude-code-runner/main.py
@@ -97,17 +97,20 @@ async def lifespan(app: FastAPI):
     
     logger.info("Adapter initialized - fresh client will be created for each run")
     
-    # Check if this is a continuation (has parent session)
-    # PARENT_SESSION_ID is set when continuing from another session
-    parent_session_id = os.getenv("PARENT_SESSION_ID", "").strip()
+    # Check if this is a resume session via IS_RESUME env var
+    # This is set by the operator when restarting a stopped/completed/failed session
+    is_resume = os.getenv("IS_RESUME", "").strip().lower() == "true"
+    if is_resume:
+        logger.info("IS_RESUME=true - this is a resumed session, will skip INITIAL_PROMPT")
     
-    # Check for INITIAL_PROMPT and auto-execute (only if no parent session)
+    # Check for INITIAL_PROMPT and auto-execute (only if not a resume)
     initial_prompt = os.getenv("INITIAL_PROMPT", "").strip()
-    if initial_prompt and not parent_session_id:
-        logger.info(f"INITIAL_PROMPT detected ({len(initial_prompt)} chars), will auto-execute after 3s delay")
+    if initial_prompt and not is_resume:
+        delay = os.getenv("INITIAL_PROMPT_DELAY_SECONDS", "1")
+        logger.info(f"INITIAL_PROMPT detected ({len(initial_prompt)} chars), will auto-execute after {delay}s delay")
         asyncio.create_task(auto_execute_initial_prompt(initial_prompt, session_id))
-    elif initial_prompt:
-        logger.info(f"INITIAL_PROMPT detected but has parent session ({parent_session_id[:12]}...) - skipping")
+    elif initial_prompt and is_resume:
+        logger.info("INITIAL_PROMPT detected but IS_RESUME=true - skipping (this is a resume)")
     
     logger.info(f"AG-UI server ready for session {session_id}")
     
@@ -120,17 +123,19 @@ async def lifespan(app: FastAPI):
 async def auto_execute_initial_prompt(prompt: str, session_id: str):
     """Auto-execute INITIAL_PROMPT by POSTing to backend after short delay.
     
-    The 3-second delay gives the runner time to fully start. Backend has retry
-    logic to handle if Service DNS isn't ready yet.
+    The delay gives the runner service time to register in DNS. Backend has retry
+    logic to handle if Service DNS isn't ready yet, so this can be short.
     
-    Only called for fresh sessions (no PARENT_SESSION_ID set).
+    Only called for fresh sessions (no hydrated state in .claude/).
     """
     import uuid
     import aiohttp
     
-    # Give runner time to fully start before backend tries to reach us
-    logger.info("Waiting 3s before auto-executing INITIAL_PROMPT (allow Service DNS to propagate)...")
-    await asyncio.sleep(3)
+    # Configurable delay (default 1s, was 3s)
+    # Backend has retry logic, so we don't need to wait long
+    delay_seconds = float(os.getenv("INITIAL_PROMPT_DELAY_SECONDS", "1"))
+    logger.info(f"Waiting {delay_seconds}s before auto-executing INITIAL_PROMPT (allow Service DNS to propagate)...")
+    await asyncio.sleep(delay_seconds)
     
     logger.info("Auto-executing INITIAL_PROMPT via backend POST...")
     
diff --git a/components/runners/state-sync/Dockerfile b/components/runners/state-sync/Dockerfile
new file mode 100644
index 000000000..b0214ff6a
--- /dev/null
+++ b/components/runners/state-sync/Dockerfile
@@ -0,0 +1,21 @@
+FROM alpine:3.19
+
+# Install rclone, git, and utilities
+RUN apk add --no-cache \
+    rclone \
+    git \
+    bash \
+    curl \
+    jq \
+    ca-certificates
+
+# Copy scripts
+COPY hydrate.sh /usr/local/bin/hydrate.sh
+COPY sync.sh /usr/local/bin/sync.sh
+
+# Make scripts executable
+RUN chmod +x /usr/local/bin/hydrate.sh /usr/local/bin/sync.sh
+
+# Default to sync.sh (used by sidecar)
+ENTRYPOINT ["/usr/local/bin/sync.sh"]
+
diff --git a/components/runners/state-sync/hydrate.sh b/components/runners/state-sync/hydrate.sh
new file mode 100644
index 000000000..165f198c4
--- /dev/null
+++ b/components/runners/state-sync/hydrate.sh
@@ -0,0 +1,232 @@
+#!/bin/bash
+# hydrate.sh - Init container script to download session state from S3
+
+set -e
+
+# Configuration from environment
+S3_ENDPOINT="${S3_ENDPOINT:-http://minio.ambient-code.svc:9000}"
+S3_BUCKET="${S3_BUCKET:-ambient-sessions}"
+NAMESPACE="${NAMESPACE:-default}"
+SESSION_NAME="${SESSION_NAME:-unknown}"
+
+# Sanitize inputs to prevent path traversal
+NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}"
+SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}"
+
+# Paths to sync (must match sync.sh)
+SYNC_PATHS=(
+    ".claude"
+    "artifacts"
+    "file-uploads"
+)
+
+# Error handler
+error_exit() {
+    echo "ERROR: $1" >&2
+    exit 1
+}
+
+# Configure rclone for S3
+setup_rclone() {
+    # Use explicit /tmp path since HOME may not be set in container
+    mkdir -p /tmp/.config/rclone || error_exit "Failed to create rclone config directory"
+    cat > /tmp/.config/rclone/rclone.conf << EOF
+[s3]
+type = s3
+provider = Other
+access_key_id = ${AWS_ACCESS_KEY_ID}
+secret_access_key = ${AWS_SECRET_ACCESS_KEY}
+endpoint = ${S3_ENDPOINT}
+acl = private
+EOF
+    if [ $? -ne 0 ]; then
+        error_exit "Failed to write rclone configuration"
+    fi
+    # Protect config file with credentials
+    chmod 600 /tmp/.config/rclone/rclone.conf || error_exit "Failed to secure rclone config"
+}
+
+echo "========================================="
+echo "Ambient Code Session State Hydration"
+echo "========================================="
+echo "Session: ${NAMESPACE}/${SESSION_NAME}"
+echo "S3 Endpoint: ${S3_ENDPOINT}"
+echo "S3 Bucket: ${S3_BUCKET}"
+echo "========================================="
+
+# Create workspace structure
+echo "Creating workspace structure..."
+mkdir -p /workspace/.claude || error_exit "Failed to create .claude directory"
+mkdir -p /workspace/artifacts || error_exit "Failed to create artifacts directory"
+mkdir -p /workspace/file-uploads || error_exit "Failed to create file-uploads directory"
+mkdir -p /workspace/repos || error_exit "Failed to create repos directory"
+
+# Set permissions on created directories (not root workspace which may be owned by different user)
+# Use 755 instead of 777 - readable by all, writable only by owner
+chmod 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
+
+# Check if S3 is configured
+if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then
+    echo "S3 not configured - using ephemeral storage only (no state persistence)"
+    echo "========================================="
+    exit 0
+fi
+
+# Setup rclone
+echo "Setting up rclone..."
+setup_rclone
+
+S3_PATH="s3:${S3_BUCKET}/${NAMESPACE}/${SESSION_NAME}"
+
+# Test S3 connection
+echo "Testing S3 connection..."
+if ! rclone --config /tmp/.config/rclone/rclone.conf lsd "s3:${S3_BUCKET}/" --max-depth 1 2>&1; then
+    error_exit "Failed to connect to S3 at ${S3_ENDPOINT}. Check endpoint and credentials."
+fi
+echo "S3 connection successful"
+
+# Check if session state exists in S3
+echo "Checking for existing session state in S3..."
+if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/" 2>/dev/null | grep -q .; then
+    echo "Found existing session state, downloading from S3..."
+    
+    # Download each sync path if it exists
+    for path in "${SYNC_PATHS[@]}"; do
+        if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/${path}/" 2>/dev/null | grep -q .; then
+            echo "  Downloading ${path}/..."
+            rclone --config /tmp/.config/rclone/rclone.conf copy "${S3_PATH}/${path}/" "/workspace/${path}/" \
+                --copy-links \
+                --transfers 8 \
+                --fast-list \
+                --progress 2>&1 || echo "  Warning: failed to download ${path}"
+        else
+            echo "  No data for ${path}/"
+        fi
+    done
+    
+    echo "State hydration complete!"
+else
+    echo "No existing state found, starting fresh session"
+fi
+
+# Set permissions on subdirectories (EmptyDir root may not be chmodable)
+echo "Setting permissions on subdirectories..."
+chmod -R 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
+
+# ========================================
+# Clone repositories and workflows
+# ========================================
+echo "========================================="
+echo "Setting up repositories and workflows..."
+echo "========================================="
+
+# Disable errexit for git clones (failures are non-fatal for private repos without auth)
+set +e
+
+# Set HOME for git config (alpine doesn't set it by default)
+export HOME=/tmp
+
+# Git identity
+GIT_USER_NAME="${GIT_USER_NAME:-Ambient Code Bot}"
+GIT_USER_EMAIL="${GIT_USER_EMAIL:-bot@ambient-code.local}"
+git config --global user.name "$GIT_USER_NAME" || echo "Warning: failed to set git user.name"
+git config --global user.email "$GIT_USER_EMAIL" || echo "Warning: failed to set git user.email"
+
+# Mark workspace as safe (in case runner needs it)
+git config --global --add safe.directory /workspace 2>/dev/null || true
+
+# Clone repos from REPOS_JSON
+if [ -n "$REPOS_JSON" ] && [ "$REPOS_JSON" != "null" ] && [ "$REPOS_JSON" != "" ]; then
+    echo "Cloning repositories from spec..."
+    # Parse JSON array and clone each repo
+    REPO_COUNT=$(echo "$REPOS_JSON" | jq -e 'if type == "array" then length else 0 end' 2>/dev/null || echo "0")
+    echo "Found $REPO_COUNT repositories to clone"
+    if [ "$REPO_COUNT" -gt 0 ]; then
+        i=0
+        while [ $i -lt $REPO_COUNT ]; do
+            REPO_URL=$(echo "$REPOS_JSON" | jq -r ".[$i].url // empty" 2>/dev/null || echo "")
+            REPO_BRANCH=$(echo "$REPOS_JSON" | jq -r ".[$i].branch // \"main\"" 2>/dev/null || echo "main")
+            
+            # Derive repo name from URL
+            REPO_NAME=$(basename "$REPO_URL" .git 2>/dev/null || echo "")
+            
+            if [ -n "$REPO_NAME" ] && [ -n "$REPO_URL" ] && [ "$REPO_URL" != "null" ]; then
+                REPO_DIR="/workspace/repos/$REPO_NAME"
+                echo "  Cloning $REPO_NAME (branch: $REPO_BRANCH)..."
+                
+                # Mark repo directory as safe
+                git config --global --add safe.directory "$REPO_DIR" 2>/dev/null || true
+                
+                # Clone repository (for private repos, runner will handle token injection)
+                if git clone --branch "$REPO_BRANCH" --single-branch "$REPO_URL" "$REPO_DIR" 2>&1; then
+                    echo "  ✓ Cloned $REPO_NAME"
+                else
+                    echo "  ⚠ Failed to clone $REPO_NAME (may require authentication)"
+                fi
+            fi
+            i=$((i + 1))
+        done
+    fi
+else
+    echo "No repositories configured in spec"
+fi
+
+# Clone workflow repository
+if [ -n "$ACTIVE_WORKFLOW_GIT_URL" ] && [ "$ACTIVE_WORKFLOW_GIT_URL" != "null" ]; then
+    WORKFLOW_BRANCH="${ACTIVE_WORKFLOW_BRANCH:-main}"
+    WORKFLOW_PATH="${ACTIVE_WORKFLOW_PATH:-}"
+    
+    echo "Cloning workflow repository..."
+    echo "  URL: $ACTIVE_WORKFLOW_GIT_URL"
+    echo "  Branch: $WORKFLOW_BRANCH"
+    if [ -n "$WORKFLOW_PATH" ]; then
+        echo "  Subpath: $WORKFLOW_PATH"
+    fi
+    
+    # Derive workflow name from URL
+    WORKFLOW_NAME=$(basename "$ACTIVE_WORKFLOW_GIT_URL" .git)
+    WORKFLOW_FINAL="/workspace/workflows/${WORKFLOW_NAME}"
+    WORKFLOW_TEMP="/tmp/workflow-clone-$$"
+    
+    git config --global --add safe.directory "$WORKFLOW_FINAL" 2>/dev/null || true
+    
+    # Clone to temp location
+    if git clone --branch "$WORKFLOW_BRANCH" --single-branch "$ACTIVE_WORKFLOW_GIT_URL" "$WORKFLOW_TEMP" 2>&1; then
+        echo "  Clone successful, processing..."
+        
+        # Extract subpath if specified
+        if [ -n "$WORKFLOW_PATH" ]; then
+            SUBPATH_FULL="$WORKFLOW_TEMP/$WORKFLOW_PATH"
+            echo "  Checking for subpath: $SUBPATH_FULL"
+            ls -la "$SUBPATH_FULL" 2>&1 || echo "  Subpath does not exist"
+            
+            if [ -d "$SUBPATH_FULL" ]; then
+                echo "  Extracting subpath: $WORKFLOW_PATH"
+                mkdir -p "$(dirname "$WORKFLOW_FINAL")"
+                cp -r "$SUBPATH_FULL" "$WORKFLOW_FINAL"
+                rm -rf "$WORKFLOW_TEMP"
+                echo "  ✓ Workflow extracted from subpath to /workspace/workflows/${WORKFLOW_NAME}"
+            else
+                echo "  ⚠ Subpath '$WORKFLOW_PATH' not found in cloned repo"
+                echo "  Available paths in repo:"
+                find "$WORKFLOW_TEMP" -maxdepth 3 -type d | head -10
+                echo "  Using entire repo instead"
+                mv "$WORKFLOW_TEMP" "$WORKFLOW_FINAL"
+                echo "  ✓ Workflow ready at /workspace/workflows/${WORKFLOW_NAME}"
+            fi
+        else
+            # No subpath - use entire repo
+            mv "$WORKFLOW_TEMP" "$WORKFLOW_FINAL"
+            echo "  ✓ Workflow ready at /workspace/workflows/${WORKFLOW_NAME}"
+        fi
+    else
+        echo "  ⚠ Failed to clone workflow"
+        rm -rf "$WORKFLOW_TEMP" 2>/dev/null || true
+    fi
+fi
+
+echo "========================================="
+echo "Workspace initialized successfully"
+echo "========================================="
+exit 0
+
diff --git a/components/runners/state-sync/sync.sh b/components/runners/state-sync/sync.sh
new file mode 100644
index 000000000..05498ac5f
--- /dev/null
+++ b/components/runners/state-sync/sync.sh
@@ -0,0 +1,156 @@
+#!/bin/bash
+# sync.sh - Sidecar script to sync session state to S3 every N seconds
+
+set -e
+
+# Configuration from environment
+S3_ENDPOINT="${S3_ENDPOINT:-http://minio.ambient-code.svc:9000}"
+S3_BUCKET="${S3_BUCKET:-ambient-sessions}"
+NAMESPACE="${NAMESPACE:-default}"
+SESSION_NAME="${SESSION_NAME:-unknown}"
+SYNC_INTERVAL="${SYNC_INTERVAL:-60}"
+MAX_SYNC_SIZE="${MAX_SYNC_SIZE:-1073741824}"  # 1GB default
+
+# Sanitize inputs to prevent path traversal
+NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}"
+SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}"
+
+# Paths to sync (non-git content)
+SYNC_PATHS=(
+    ".claude"
+    "artifacts"
+    "file-uploads"
+)
+
+# Patterns to exclude from sync
+EXCLUDE_PATTERNS=(
+    "repos/**"           # Git handles this
+    "node_modules/**"
+    ".venv/**"
+    "__pycache__/**"
+    ".cache/**"
+    "*.pyc"
+    "target/**"
+    "dist/**"
+    "build/**"
+    ".git/**"
+    ".claude/debug/**"   # Debug logs with symlinks that break rclone
+)
+
+# Configure rclone for S3
+setup_rclone() {
+    # Use explicit /tmp path since HOME may not be set in container
+    mkdir -p /tmp/.config/rclone
+    cat > /tmp/.config/rclone/rclone.conf << EOF
+[s3]
+type = s3
+provider = Other
+access_key_id = ${AWS_ACCESS_KEY_ID}
+secret_access_key = ${AWS_SECRET_ACCESS_KEY}
+endpoint = ${S3_ENDPOINT}
+acl = private
+EOF
+    # Protect config file with credentials
+    chmod 600 /tmp/.config/rclone/rclone.conf
+}
+
+# Check total size before sync
+check_size() {
+    local total=0
+    for path in "${SYNC_PATHS[@]}"; do
+        if [ -d "/workspace/${path}" ]; then
+            size=$(du -sb "/workspace/${path}" 2>/dev/null | cut -f1 || echo 0)
+            total=$((total + size))
+        fi
+    done
+    
+    if [ $total -gt $MAX_SYNC_SIZE ]; then
+        echo "WARNING: Sync size (${total} bytes) exceeds limit (${MAX_SYNC_SIZE} bytes)"
+        echo "Some files may be skipped"
+        return 1
+    fi
+    return 0
+}
+
+# Sync workspace state to S3
+sync_to_s3() {
+    local s3_path="s3:${S3_BUCKET}/${NAMESPACE}/${SESSION_NAME}"
+    
+    echo "[$(date -Iseconds)] Starting sync to S3..."
+    
+    local synced=0
+    for path in "${SYNC_PATHS[@]}"; do
+        if [ -d "/workspace/${path}" ]; then
+            echo "  Syncing ${path}/..."
+            if rclone --config /tmp/.config/rclone/rclone.conf sync "/workspace/${path}" "${s3_path}/${path}/" \
+                --checksum \
+                --copy-links \
+                --transfers 4 \
+                --fast-list \
+                --stats-one-line \
+                --max-size ${MAX_SYNC_SIZE} \
+                $(printf -- '--exclude %s ' "${EXCLUDE_PATTERNS[@]}") \
+                2>&1; then
+                synced=$((synced + 1))
+            else
+                echo "  Warning: sync of ${path} had errors"
+            fi
+        fi
+    done
+    
+    # Save metadata
+    echo "{\"lastSync\": \"$(date -Iseconds)\", \"session\": \"${SESSION_NAME}\", \"namespace\": \"${NAMESPACE}\", \"pathsSynced\": ${synced}}" > /tmp/metadata.json
+    rclone --config /tmp/.config/rclone/rclone.conf copy /tmp/metadata.json "${s3_path}/" 2>&1 || true
+    
+    echo "[$(date -Iseconds)] Sync complete (${synced} paths synced)"
+}
+
+# Final sync on shutdown
+final_sync() {
+    echo ""
+    echo "========================================="
+    echo "[$(date -Iseconds)] SIGTERM received, performing final sync..."
+    echo "========================================="
+    sync_to_s3
+    echo "========================================="
+    echo "[$(date -Iseconds)] Final sync complete, exiting"
+    echo "========================================="
+    exit 0
+}
+
+# Main
+echo "========================================="
+echo "Ambient Code State Sync Sidecar"
+echo "========================================="
+echo "Session: ${NAMESPACE}/${SESSION_NAME}"
+echo "S3 Endpoint: ${S3_ENDPOINT}"
+echo "S3 Bucket: ${S3_BUCKET}"
+echo "Sync interval: ${SYNC_INTERVAL}s"
+echo "Max sync size: ${MAX_SYNC_SIZE} bytes"
+echo "========================================="
+
+# Check if S3 is configured
+if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then
+    echo "S3 not configured - state sync disabled (ephemeral storage only)"
+    echo "Session will not persist across pod restarts"
+    echo "========================================="
+    # Sleep forever - keep sidecar alive but do nothing
+    while true; do
+        sleep 3600
+    done
+fi
+
+setup_rclone
+trap 'final_sync' SIGTERM SIGINT
+
+# Initial delay to let workspace populate
+echo "Waiting 30s for workspace to populate..."
+sleep 30
+
+# Main sync loop
+while true; do
+    check_size || echo "Size check warning (continuing anyway)"
+    sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..."
+    sleep ${SYNC_INTERVAL}
+done
+
diff --git a/docs/minio-quickstart.md b/docs/minio-quickstart.md
new file mode 100644
index 000000000..26fbe2a5c
--- /dev/null
+++ b/docs/minio-quickstart.md
@@ -0,0 +1,297 @@
+# MinIO Quickstart for Ambient Code
+
+## Overview
+
+MinIO provides in-cluster S3-compatible storage for Ambient Code session state, artifacts, and uploads. This guide shows you how to deploy and configure MinIO.
+
+## Quick Setup
+
+### 1. Deploy MinIO
+
+```bash
+# Create MinIO credentials secret
+cd components/manifests/base
+cp minio-credentials-secret.yaml.example minio-credentials-secret.yaml
+
+# Edit credentials (change admin/changeme123 to secure values)
+vi minio-credentials-secret.yaml
+
+# Apply the secret
+kubectl apply -f minio-credentials-secret.yaml -n ambient-code
+
+# MinIO deployment is included in base manifests, so deploy normally
+make deploy NAMESPACE=ambient-code
+```
+
+### 2. Create Bucket
+
+```bash
+# Run automated setup
+make setup-minio NAMESPACE=ambient-code
+
+# Or manually:
+kubectl port-forward svc/minio 9001:9001 -n ambient-code &
+open http://localhost:9001
+# Login with credentials, create bucket "ambient-sessions"
+```
+
+### 3. Configure Project
+
+Navigate to project settings in the UI and configure:
+
+| Field | Value |
+|-------|-------|
+| **Enable S3 Storage** | ✅ Checked |
+| **S3_ENDPOINT** | `http://minio.ambient-code.svc:9000` |
+| **S3_BUCKET** | `ambient-sessions` |
+| **S3_REGION** | `us-east-1` (not used by MinIO but required field) |
+| **S3_ACCESS_KEY** | Your MinIO root user |
+| **S3_SECRET_KEY** | Your MinIO root password |
+
+Click **Save Integration Secrets**.
+
+## Accessing MinIO Console
+
+### Option 1: Port Forward
+
+```bash
+make minio-console NAMESPACE=ambient-code
+# Opens at http://localhost:9001
+```
+
+### Option 2: Create Route (OpenShift)
+
+```bash
+oc create route edge minio-console \
+  --service=minio \
+  --port=9001 \
+  -n ambient-code
+
+# Get URL
+oc get route minio-console -n ambient-code -o jsonpath='{.spec.host}'
+```
+
+## Viewing Session Artifacts
+
+### Via MinIO Console
+
+1. Open MinIO console: `make minio-console`
+2. Navigate to "Buckets" → "ambient-sessions"
+3. Browse: `{namespace}/{session-name}/`
+   - `.claude/` - Session history
+   - `artifacts/` - Generated files
+   - `uploads/` - User uploads
+
+### Via MinIO Client (mc)
+
+```bash
+# Install mc
+brew install minio/stable/mc
+
+# Configure alias
+kubectl port-forward svc/minio 9000:9000 -n ambient-code &
+mc alias set ambient http://localhost:9000 admin changeme123
+
+# List sessions
+mc ls ambient/ambient-sessions/
+
+# List session artifacts
+mc ls ambient/ambient-sessions/my-project/session-abc/artifacts/
+
+# Download artifacts
+mc cp --recursive ambient/ambient-sessions/my-project/session-abc/artifacts/ ./local-dir/
+
+# Download session history
+mc cp --recursive ambient/ambient-sessions/my-project/session-abc/.claude/ ./.claude/
+```
+
+### Via kubectl exec
+
+```bash
+# Get MinIO pod
+MINIO_POD=$(kubectl get pod -l app=minio -n ambient-code -o jsonpath='{.items[0].metadata.name}')
+
+# List sessions
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc ls local/ambient-sessions/
+
+# Download file
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc cp "local/ambient-sessions/my-project/session-abc/artifacts/report.pdf" /tmp/
+kubectl cp "ambient-code/${MINIO_POD}:/tmp/report.pdf" ./report.pdf
+```
+
+## Management Commands
+
+```bash
+# Check MinIO status
+make minio-status NAMESPACE=ambient-code
+
+# View MinIO logs
+make minio-logs NAMESPACE=ambient-code
+
+# Port forward to MinIO API (for mc commands)
+kubectl port-forward svc/minio 9000:9000 -n ambient-code
+```
+
+## Bucket Lifecycle Management
+
+### Set Auto-Delete Policy
+
+Keep storage costs down by auto-deleting old sessions:
+
+```bash
+# Create lifecycle policy
+cat > /tmp/lifecycle.json << 'EOF'
+{
+  "Rules": [
+    {
+      "ID": "expire-old-sessions",
+      "Status": "Enabled",
+      "Expiration": {
+        "Days": 30
+      }
+    }
+  ]
+}
+EOF
+
+# Apply policy
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc ilm import "local/ambient-sessions" /tmp/lifecycle.json
+```
+
+### Monitor Storage Usage
+
+```bash
+# Check bucket size
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc du local/ambient-sessions
+
+# List largest sessions
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc du --depth 2 local/ambient-sessions | sort -n -r | head -10
+```
+
+## Backup and Restore
+
+### Backup MinIO Data
+
+```bash
+# Backup to local directory
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc mirror local/ambient-sessions /tmp/backup/
+kubectl cp "ambient-code/${MINIO_POD}:/tmp/backup" ./minio-backup/
+
+# Or use external mc client
+mc mirror ambient/ambient-sessions ./minio-backup/
+```
+
+### Restore from Backup
+
+```bash
+# Copy backup to pod
+kubectl cp ./minio-backup/ "ambient-code/${MINIO_POD}:/tmp/restore"
+
+# Restore
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc mirror /tmp/restore local/ambient-sessions
+```
+
+## Troubleshooting
+
+### MinIO Pod Not Starting
+
+```bash
+# Check events
+kubectl get events -n ambient-code --sort-by='.lastTimestamp' | grep minio
+
+# Check PVC
+kubectl get pvc minio-data -n ambient-code
+
+# Check pod logs
+kubectl logs -f deployment/minio -n ambient-code
+```
+
+### Can't Access MinIO Console
+
+```bash
+# Check service
+kubectl get svc minio -n ambient-code
+
+# Test connection from within cluster
+kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n ambient-code -- \
+  curl -v http://minio.ambient-code.svc:9000/minio/health/live
+```
+
+### Session Init Failing
+
+```bash
+# Check session pod init container logs
+kubectl logs {session-pod} -c init-hydrate -n {namespace}
+
+# Common issues:
+# - Wrong S3 endpoint (check project settings)
+# - Bucket doesn't exist (create in MinIO console)
+# - Wrong credentials (verify in project settings)
+```
+
+## Production Considerations
+
+### High Availability
+
+For production, deploy MinIO in distributed mode:
+
+```bash
+# Use MinIO Operator
+kubectl apply -k "github.com/minio/operator"
+kubectl apply -f - <<EOF
+apiVersion: minio.min.io/v2
+kind: Tenant
+metadata:
+  name: ambient-sessions
+  namespace: ambient-code
+spec:
+  pools:
+  - servers: 4
+    volumesPerServer: 4
+    volumeClaimTemplate:
+      spec:
+        accessModes:
+        - ReadWriteOnce
+        resources:
+          requests:
+            storage: 50Gi
+EOF
+```
+
+### Security
+
+1. **Change default credentials**: Use strong passwords in production
+2. **Enable TLS**: Configure MinIO with TLS certificates
+3. **Network policies**: Restrict access to MinIO service
+4. **Encryption**: Enable server-side encryption (SSE-S3 or SSE-KMS)
+
+### Monitoring
+
+```bash
+# Enable Prometheus metrics
+kubectl exec -n ambient-code "${MINIO_POD}" -- mc admin prometheus generate local
+
+# Access metrics
+kubectl port-forward svc/minio 9000:9000 -n ambient-code
+curl http://localhost:9000/minio/v2/metrics/cluster
+```
+
+## Alternative: External S3
+
+If you prefer AWS S3 or another provider:
+
+1. **Skip MinIO deployment**: Don't apply `minio-deployment.yaml`
+2. **Configure in project settings**:
+   - S3_ENDPOINT: `https://s3.amazonaws.com` (or your provider)
+   - S3_BUCKET: Your bucket name
+   - S3_ACCESS_KEY: IAM access key
+   - S3_SECRET_KEY: IAM secret key
+3. **Ensure bucket exists** in your S3 provider
+4. **Set IAM permissions**: GetObject, PutObject, DeleteObject, ListBucket
+
+## Next Steps
+
+- [S3 Storage Configuration](s3-storage-configuration.md) - Detailed S3 setup
+- [Create your first session](../getting-started.md) - Test S3 persistence
+- [MinIO Documentation](https://min.io/docs/minio/kubernetes/upstream/) - Official MinIO docs
+
diff --git a/docs/operator-metrics-visualization.md b/docs/operator-metrics-visualization.md
new file mode 100644
index 000000000..c997e48cc
--- /dev/null
+++ b/docs/operator-metrics-visualization.md
@@ -0,0 +1,134 @@
+# Operator Metrics Visualization Guide
+
+Visualize Ambient Code operator metrics on OpenShift using User Workload Monitoring.
+
+## Architecture
+
+```
+Operator (OTel) → OTel Collector → OpenShift Prometheus
+                                          ↓
+                                   OpenShift Console
+                                          ↓
+                                   Grafana (optional)
+```
+
+## Quick Start
+
+```bash
+# Deploy observability
+make deploy-observability
+
+# View metrics in OpenShift Console → Observe → Metrics
+
+# Optional: Add Grafana
+make add-grafana
+```
+
+**Full deployment guide**: See `components/manifests/observability/README.md`
+
+---
+
+## Available Metrics
+
+| Metric | Description |
+|--------|-------------|
+| `ambient_session_startup_duration` | Time from Pending to Running |
+| `ambient_session_phase_transitions` | Phase transition count |
+| `ambient_sessions_total` | Total sessions created |
+| `ambient_sessions_completed` | Sessions in terminal states |
+| `ambient_reconcile_duration` | Reconciliation performance |
+| `ambient_pod_creation_duration` | Pod provisioning speed |
+| `ambient_session_errors` | Errors during reconciliation |
+
+---
+
+## Example Queries
+
+In **OpenShift Console → Observe → Metrics**:
+
+```promql
+# Total sessions created
+ambient_sessions_total
+
+# Session creation rate
+rate(ambient_sessions_total[5m])
+
+# p95 session startup time
+histogram_quantile(0.95, rate(ambient_session_startup_duration_bucket[5m]))
+
+# Error rate by namespace
+sum by (namespace) (rate(ambient_session_errors[5m]))
+
+# Sessions by final phase
+sum by (final_phase) (increase(ambient_sessions_completed[1h]))
+```
+
+---
+
+## Adding Grafana (Optional)
+
+For custom dashboards:
+
+```bash
+# Deploy Grafana
+make add-grafana
+
+# Create route
+oc create route edge grafana --service=grafana -n ambient-code
+
+# Get URL
+oc get route grafana -n ambient-code -o jsonpath='{.spec.host}'
+# Login: admin/admin
+```
+
+**Import dashboard**: Upload `components/manifests/observability/dashboards/ambient-operator-dashboard.json`
+
+---
+
+## Troubleshooting
+
+### No metrics in OpenShift Console
+
+1. Check User Workload Monitoring is enabled:
+   ```bash
+   oc -n openshift-user-workload-monitoring get pod
+   ```
+
+2. Verify ServiceMonitor exists:
+   ```bash
+   oc get servicemonitor ambient-otel-collector -n ambient-code
+   ```
+
+3. Check OTel Collector is receiving metrics:
+   ```bash
+   oc logs -l app=otel-collector -n ambient-code | grep -i metric
+   ```
+
+4. Test OTel Collector endpoint:
+   ```bash
+   oc port-forward svc/otel-collector 8889:8889 -n ambient-code
+   curl http://localhost:8889/metrics | grep ambient
+   ```
+
+### Grafana shows "No data"
+
+1. Check Grafana ServiceAccount has permissions:
+   ```bash
+   oc auth can-i get --subresource=metrics pods \
+     --as=system:serviceaccount:ambient-code:grafana
+   ```
+
+2. Test datasource in Grafana UI:
+   - Configuration → Data Sources → OpenShift Prometheus → Test
+
+---
+
+## For Non-OpenShift Deployments
+
+If you're not on OpenShift, you need to deploy Prometheus yourself:
+
+```bash
+kubectl apply -f components/manifests/observability/prometheus.yaml
+```
+
+Update `grafana-datasource-patch.yaml` to point to `http://prometheus:9090` instead of OpenShift Prometheus.
diff --git a/docs/s3-storage-configuration.md b/docs/s3-storage-configuration.md
new file mode 100644
index 000000000..1027d21a3
--- /dev/null
+++ b/docs/s3-storage-configuration.md
@@ -0,0 +1,393 @@
+# S3 Storage Configuration
+
+## Overview
+
+The Ambient Code Platform uses S3-compatible storage to persist session state, artifacts, and user uploads across pod restarts. This document explains how to configure S3 storage per project and access session data.
+
+## Why S3 Storage?
+
+Sessions no longer use Persistent Volume Claims (PVCs) for storage. Instead:
+
+- **Session pods use EmptyDir** - Fast startup, no PVC provisioning delays
+- **State persists in S3** - Session history, artifacts, and uploads are automatically synced to S3
+- **Code persists in Git** - Code changes are managed via git push (user or agent driven)
+
+**Benefits:**
+- Fast session startup (no PVC provisioning wait)
+- Cost-effective storage (S3 is cheaper than block storage)
+- Per-project isolation (each project configures its own S3 bucket)
+- Easy artifact access (browse S3 bucket directly)
+
+## S3 Storage Structure
+
+```
+s3://your-bucket/
+  └── {namespace}/
+      └── {session-name}/
+          ├── .claude/           # Claude session history for resume
+          ├── artifacts/         # Generated files
+          ├── uploads/           # User uploaded files
+          └── metadata.json      # Sync metadata
+```
+
+## Configuration
+
+### 1. Configure S3 in Project Settings
+
+Navigate to your project's settings page:
+
+```
+https://your-deployment/projects/{project-name}?section=settings
+```
+
+Find the **S3 Storage Configuration** section and configure:
+
+| Field | Description | Example |
+|-------|-------------|---------|
+| **Enable S3 Storage** | Enable/disable S3 persistence | ✅ Checked |
+| **S3_ENDPOINT** | S3-compatible endpoint | `https://s3.amazonaws.com` |
+| **S3_BUCKET** | Bucket name for session storage | `ambient-sessions` |
+| **S3_REGION** | AWS region (optional) | `us-east-1` |
+| **S3_ACCESS_KEY** | S3 access key ID | `AKIAIOSFODNN7EXAMPLE` |
+| **S3_SECRET_KEY** | S3 secret access key | `wJalrXUtnFEMI/K7MDENG/...` |
+
+Click **Save Integration Secrets** to persist the configuration.
+
+### 2. Create S3 Bucket (AWS)
+
+If using AWS S3:
+
+```bash
+# Create bucket
+aws s3 mb s3://ambient-sessions --region us-east-1
+
+# Set bucket lifecycle policy (optional - auto-delete old sessions)
+aws s3api put-bucket-lifecycle-configuration \
+  --bucket ambient-sessions \
+  --lifecycle-configuration file://lifecycle.json
+```
+
+**lifecycle.json:**
+```json
+{
+  "Rules": [
+    {
+      "Id": "DeleteOldSessions",
+      "Status": "Enabled",
+      "Prefix": "",
+      "Expiration": {
+        "Days": 30
+      }
+    }
+  ]
+}
+```
+
+### 3. Create IAM User and Policy (AWS)
+
+Create an IAM user with the following policy:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Action": [
+        "s3:GetObject",
+        "s3:PutObject",
+        "s3:DeleteObject",
+        "s3:ListBucket"
+      ],
+      "Resource": [
+        "arn:aws:s3:::ambient-sessions",
+        "arn:aws:s3:::ambient-sessions/*"
+      ]
+    }
+  ]
+}
+```
+
+### 4. Using MinIO (Self-Hosted)
+
+For self-hosted S3-compatible storage:
+
+```bash
+# Deploy MinIO in your cluster
+kubectl create namespace minio
+kubectl apply -f - <<EOF
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: minio
+  namespace: minio
+spec:
+  replicas: 1
+  selector:
+    matchLabels:
+      app: minio
+  template:
+    metadata:
+      labels:
+        app: minio
+    spec:
+      containers:
+      - name: minio
+        image: minio/minio:latest
+        command:
+        - /bin/bash
+        - -c
+        - minio server /data --console-address ":9001"
+        ports:
+        - containerPort: 9000
+          name: api
+        - containerPort: 9001
+          name: console
+        env:
+        - name: MINIO_ROOT_USER
+          value: "admin"
+        - name: MINIO_ROOT_PASSWORD
+          value: "changeme123"
+        volumeMounts:
+        - name: data
+          mountPath: /data
+      volumes:
+      - name: data
+        persistentVolumeClaim:
+          claimName: minio-data
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: minio
+  namespace: minio
+spec:
+  ports:
+  - port: 9000
+    name: api
+  - port: 9001
+    name: console
+  selector:
+    app: minio
+EOF
+
+# Create bucket
+mc alias set myminio http://minio.minio.svc:9000 admin changeme123
+mc mb myminio/ambient-sessions
+```
+
+Configure in project settings:
+- **S3_ENDPOINT**: `http://minio.minio.svc:9000`
+- **S3_BUCKET**: `ambient-sessions`
+- **S3_ACCESS_KEY**: `admin`
+- **S3_SECRET_KEY**: `changeme123`
+
+## Accessing Session Artifacts
+
+### Option 1: AWS CLI
+
+```bash
+# List sessions
+aws s3 ls s3://ambient-sessions/my-project/
+
+# List artifacts for a session
+aws s3 ls s3://ambient-sessions/my-project/session-abc/artifacts/
+
+# Download artifacts
+aws s3 cp s3://ambient-sessions/my-project/session-abc/artifacts/ ./local-dir/ --recursive
+
+# Download session history
+aws s3 cp s3://ambient-sessions/my-project/session-abc/.claude/ ./.claude/ --recursive
+```
+
+### Option 2: MinIO Client (mc)
+
+```bash
+# Configure alias
+mc alias set ambient http://minio.minio.svc:9000 admin changeme123
+
+# List sessions
+mc ls ambient/ambient-sessions/my-project/
+
+# Download artifacts
+mc cp --recursive ambient/ambient-sessions/my-project/session-abc/artifacts/ ./local-dir/
+```
+
+### Option 3: S3 Browser
+
+Use any S3 browser tool:
+- [Cyberduck](https://cyberduck.io/)
+- [S3 Browser](https://s3browser.com/)
+- [MinIO Console](https://min.io/docs/minio/kubernetes/upstream/operations/monitoring/minio-console.html) (for MinIO)
+
+## Behavior Without S3
+
+If S3 is not configured or disabled:
+
+- ✅ Sessions will still work
+- ⚠️ Session state is ephemeral (lost on pod restart)
+- ⚠️ No session resume functionality
+- ⚠️ Artifacts lost when pod terminates
+
+**Recommended:** Always configure S3 for production use.
+
+## Troubleshooting
+
+### Session Fails to Start
+
+Check operator logs:
+```bash
+oc logs -f deployment/agentic-operator -n ambient-code | grep S3
+```
+
+Common errors:
+- **"incomplete S3 configuration"** - Missing endpoint, bucket, or credentials
+- **"S3 disabled in project settings"** - S3_ENABLED is set to "false"
+- **"failed to read project secret"** - Permission issue accessing project secret
+
+### Init Container Fails
+
+Check init-hydrate logs:
+```bash
+oc logs {pod-name} -c init-hydrate -n {namespace}
+```
+
+Common errors:
+- **"connection refused"** - S3 endpoint unreachable
+- **"access denied"** - Invalid credentials or missing permissions
+- **"bucket not found"** - Bucket doesn't exist, create it first
+
+### State-Sync Sidecar Not Syncing
+
+Check state-sync logs:
+```bash
+oc logs {pod-name} -c state-sync -n {namespace} -f
+```
+
+Look for sync confirmations:
+```
+[2024-12-22T15:30:00Z] Starting sync to S3...
+  Syncing .claude/...
+  Syncing artifacts/...
+[2024-12-22T15:30:05Z] Sync complete (2 paths synced)
+```
+
+## Security Considerations
+
+### Credential Management
+
+- **Per-project credentials**: Each project stores its own S3 credentials
+- **Kubernetes Secrets**: Credentials stored in `ambient-non-vertex-integrations` secret
+- **No global credentials**: No cluster-wide S3 secret
+
+### Bucket Isolation
+
+- **Use separate buckets per environment**: `ambient-sessions-prod`, `ambient-sessions-dev`
+- **Use bucket policies**: Restrict access to specific IAM users/roles
+- **Enable encryption**: Use S3 bucket encryption (SSE-S3 or SSE-KMS)
+
+### Access Control
+
+Example bucket policy for multi-tenant isolation:
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Effect": "Allow",
+      "Principal": {
+        "AWS": "arn:aws:iam::123456789012:user/project-a-user"
+      },
+      "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
+      "Resource": "arn:aws:s3:::ambient-sessions/project-a/*"
+    }
+  ]
+}
+```
+
+## Migration from PVCs
+
+If you have existing sessions with PVCs:
+
+1. **Existing sessions continue working** - They'll use their PVCs until completion
+2. **New sessions use S3** - EmptyDir + S3 sync automatically
+3. **Clean up old PVCs** - After sessions complete:
+
+```bash
+# List orphaned PVCs
+oc get pvc -n {namespace} -l app=ambient-workspace
+
+# Delete old PVCs
+oc delete pvc ambient-workspace-{session-name} -n {namespace}
+```
+
+## Performance Tuning
+
+### Sync Interval
+
+Default: 60 seconds. Adjust via operator config:
+
+```yaml
+# components/manifests/base/operator-deployment.yaml
+env:
+  - name: SYNC_INTERVAL
+    value: "30"  # Sync every 30 seconds (more frequent = safer but higher cost)
+```
+
+### Max Sync Size
+
+Default: 1GB. Prevent runaway storage costs:
+
+```yaml
+env:
+  - name: MAX_SYNC_SIZE
+    value: "2147483648"  # 2GB limit
+```
+
+### Exclude Patterns
+
+Customize what gets synced by editing `sync.sh`:
+
+```bash
+EXCLUDE_PATTERNS=(
+    "repos/**"           # Git handles this
+    "node_modules/**"    # Add more as needed
+    ".venv/**"
+    "__pycache__/**"
+)
+```
+
+## FAQ
+
+### Q: What happens if S3 is down?
+
+**A:** Session pods will fail to start if S3 is unreachable during hydration. The init-hydrate container will timeout and pod creation will fail.
+
+### Q: Can I use multiple S3 endpoints for different projects?
+
+**A:** Yes! Each project configures its own S3 endpoint and bucket. Projects can use different S3 providers.
+
+### Q: How do I migrate session state between buckets?
+
+**A:** Use `rclone` or `aws s3 sync` to copy data between buckets:
+
+```bash
+rclone sync s3-old:old-bucket/namespace/session s3-new:new-bucket/namespace/session
+```
+
+### Q: What if I don't configure S3?
+
+**A:** Sessions work but use ephemeral storage only. State is lost on pod restart. Not recommended for production.
+
+### Q: How much does S3 storage cost?
+
+**A:** Varies by provider. AWS S3 Standard: ~$0.023/GB/month. For 100 sessions @ 100MB each: ~$0.23/month.
+
+## Support
+
+For issues or questions:
+- Check operator logs: `oc logs -f deployment/agentic-operator -n ambient-code`
+- Check pod logs: `oc logs {pod-name} -c state-sync -n {namespace}`
+- File an issue: [GitHub Issues](https://github.com/gkrumbac/vTeam/issues)
+
diff --git a/scripts/setup-minio.sh b/scripts/setup-minio.sh
new file mode 100755
index 000000000..eb8f42e42
--- /dev/null
+++ b/scripts/setup-minio.sh
@@ -0,0 +1,85 @@
+#!/bin/bash
+# setup-minio.sh - Set up MinIO and create initial bucket for Ambient Code sessions
+
+set -e
+
+NAMESPACE="${NAMESPACE:-ambient-code}"
+BUCKET_NAME="${BUCKET_NAME:-ambient-sessions}"
+
+# Get credentials from existing secret (more secure than defaults)
+if kubectl get secret minio-credentials -n "${NAMESPACE}" >/dev/null 2>&1; then
+    MINIO_USER=$(kubectl get secret minio-credentials -n "${NAMESPACE}" -o jsonpath='{.data.root-user}' | base64 -d)
+    MINIO_PASSWORD=$(kubectl get secret minio-credentials -n "${NAMESPACE}" -o jsonpath='{.data.root-password}' | base64 -d)
+else
+    echo "ERROR: minio-credentials secret not found in namespace ${NAMESPACE}"
+    echo "Please create it first:"
+    echo "  1. Copy components/manifests/base/minio-credentials-secret.yaml.example to minio-credentials-secret.yaml"
+    echo "  2. Edit with secure credentials"
+    echo "  3. kubectl apply -f minio-credentials-secret.yaml -n ${NAMESPACE}"
+    exit 1
+fi
+
+echo "========================================="
+echo "MinIO Setup for Ambient Code Platform"
+echo "========================================="
+echo "Namespace: ${NAMESPACE}"
+echo "Bucket: ${BUCKET_NAME}"
+echo "========================================="
+
+# Check if MinIO is deployed
+echo "Checking MinIO deployment..."
+if ! kubectl get deployment minio -n "${NAMESPACE}" >/dev/null 2>&1; then
+    echo "Error: MinIO deployment not found in namespace ${NAMESPACE}"
+    echo "Deploy MinIO first: kubectl apply -f components/manifests/base/minio-deployment.yaml"
+    exit 1
+fi
+
+# Wait for MinIO to be ready
+echo "Waiting for MinIO to be ready..."
+kubectl wait --for=condition=ready pod -l app=minio -n "${NAMESPACE}" --timeout=120s
+
+# Get MinIO pod name
+MINIO_POD=$(kubectl get pod -l app=minio -n "${NAMESPACE}" -o jsonpath='{.items[0].metadata.name}')
+echo "MinIO pod: ${MINIO_POD}"
+
+# Set up MinIO client alias
+echo "Configuring MinIO client..."
+kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc alias set local http://localhost:9000 "${MINIO_USER}" "${MINIO_PASSWORD}"
+
+# Create bucket if it doesn't exist
+echo "Creating bucket: ${BUCKET_NAME}..."
+if kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc ls "local/${BUCKET_NAME}" >/dev/null 2>&1; then
+    echo "Bucket ${BUCKET_NAME} already exists"
+else
+    kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc mb "local/${BUCKET_NAME}"
+    echo "Created bucket: ${BUCKET_NAME}"
+fi
+
+# Set bucket to private (default)
+echo "Setting bucket policy..."
+kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc anonymous set none "local/${BUCKET_NAME}"
+
+# Enable versioning (optional - helps with recovery)
+echo "Enabling versioning..."
+kubectl exec -n "${NAMESPACE}" "${MINIO_POD}" -- mc version enable "local/${BUCKET_NAME}"
+
+# Show bucket info
+echo ""
+echo "========================================="
+echo "MinIO Setup Complete!"
+echo "========================================="
+echo "Bucket: ${BUCKET_NAME}"
+echo "Endpoint: http://minio.${NAMESPACE}.svc:9000"
+echo ""
+echo "MinIO Console Access:"
+echo "  kubectl port-forward svc/minio 9001:9001 -n ${NAMESPACE}"
+echo "  Then open: http://localhost:9001"
+echo "  Login: ${MINIO_USER} / ${MINIO_PASSWORD}"
+echo ""
+echo "Configure in Project Settings:"
+echo "  S3_ENDPOINT: http://minio.${NAMESPACE}.svc:9000"
+echo "  S3_BUCKET: ${BUCKET_NAME}"
+echo "  S3_ACCESS_KEY: ${MINIO_USER}"
+echo "  S3_SECRET_KEY: ${MINIO_PASSWORD}"
+echo "========================================="
+

From 7be7e66b34ef6b604aa0b28ef47804881e17212a Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Mon, 5 Jan 2026 17:21:35 -0600
Subject: [PATCH 2/6] feat: Enhance repository management and session handling

- Implemented runtime cloning of repositories when added to a session, improving user experience by allowing immediate access to code.
- Updated session handling to derive repository names from URLs, ensuring consistency in naming conventions.
- Added user authentication and authorization validation for session-related API endpoints, enhancing security.
- Improved frontend session detail page to conditionally display options and menus based on session status, streamlining user interaction.
- Refactored backend code to remove legacy watch-based implementations, transitioning to a more efficient controller-runtime based approach for session management.
---
 components/backend/handlers/sessions.go       |  73 ++++-
 .../[name]/sessions/[sessionName]/page.tsx    |  48 +--
 .../operator/internal/handlers/reconciler.go  |  55 ----
 .../operator/internal/handlers/sessions.go    |  70 +---
 components/operator/main.go                   |  31 +-
 .../runners/claude-code-runner/adapter.py     |  22 +-
 components/runners/claude-code-runner/main.py | 299 +++++++++++++++++-
 components/runners/state-sync/hydrate.sh      |  24 +-
 components/runners/state-sync/sync.sh         |  31 +-
 9 files changed, 459 insertions(+), 194 deletions(-)

diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go
index 591213de8..6af2681a8 100644
--- a/components/backend/handlers/sessions.go
+++ b/components/backend/handlers/sessions.go
@@ -2,6 +2,7 @@
 package handlers
 
 import (
+	"bytes"
 	"context"
 	"encoding/base64"
 	"encoding/json"
@@ -1276,6 +1277,52 @@ func AddRepo(c *gin.Context) {
 		return
 	}
 
+	// Derive repo name from URL
+	repoName := req.URL
+	if idx := strings.LastIndex(req.URL, "/"); idx != -1 {
+		repoName = req.URL[idx+1:]
+	}
+	repoName = strings.TrimSuffix(repoName, ".git")
+
+	// Call runner to clone the repository (if session is running)
+	status, _ := item.Object["status"].(map[string]interface{})
+	phase, _ := status["phase"].(string)
+	if phase == "Running" {
+		runnerURL := fmt.Sprintf("http://session-%s.%s.svc.cluster.local:8001/repos/add", sessionName, project)
+		runnerReq := map[string]string{
+			"url":    req.URL,
+			"branch": req.Branch,
+			"name":   repoName,
+		}
+		reqBody, _ := json.Marshal(runnerReq)
+
+		log.Printf("Calling runner to clone repo: %s -> %s", req.URL, runnerURL)
+		httpReq, err := http.NewRequestWithContext(c.Request.Context(), "POST", runnerURL, bytes.NewReader(reqBody))
+		if err != nil {
+			log.Printf("Failed to create runner request: %v", err)
+			c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to create runner request"})
+			return
+		}
+		httpReq.Header.Set("Content-Type", "application/json")
+
+		client := &http.Client{Timeout: 120 * time.Second} // Allow time for clone
+		resp, err := client.Do(httpReq)
+		if err != nil {
+			log.Printf("Failed to call runner to clone repo: %v", err)
+			c.JSON(http.StatusInternalServerError, gin.H{"error": "Failed to clone repository (runner not reachable)"})
+			return
+		}
+		defer resp.Body.Close()
+
+		if resp.StatusCode != http.StatusOK {
+			body, _ := io.ReadAll(resp.Body)
+			log.Printf("Runner failed to clone repo (status %d): %s", resp.StatusCode, string(body))
+			c.JSON(resp.StatusCode, gin.H{"error": fmt.Sprintf("Failed to clone repository: %s", string(body))})
+			return
+		}
+		log.Printf("Runner successfully cloned repo %s for session %s", repoName, sessionName)
+	}
+
 	// Update spec.repos
 	spec, ok := item.Object["spec"].(map[string]interface{})
 	if !ok {
@@ -1315,7 +1362,7 @@ func AddRepo(c *gin.Context) {
 	}
 
 	log.Printf("Added repository %s to session %s in project %s", req.URL, sessionName, project)
-	c.JSON(http.StatusOK, gin.H{"message": "Repository added", "session": session})
+	c.JSON(http.StatusOK, gin.H{"message": "Repository added", "name": repoName, "session": session})
 }
 
 // RemoveRepo removes a repository from a running session
@@ -1420,6 +1467,14 @@ func GetWorkflowMetadata(c *gin.Context) {
 		return
 	}
 
+	// Validate user authentication and authorization
+	reqK8s, _ := GetK8sClientsForRequest(c)
+	if reqK8s == nil {
+		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
+		c.Abort()
+		return
+	}
+
 	// Get authorization token
 	token := c.GetHeader("Authorization")
 	if strings.TrimSpace(token) == "" {
@@ -2209,6 +2264,14 @@ func ListSessionWorkspace(c *gin.Context) {
 		return
 	}
 
+	// Validate user authentication and authorization
+	reqK8s, _ := GetK8sClientsForRequest(c)
+	if reqK8s == nil {
+		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
+		c.Abort()
+		return
+	}
+
 	rel := strings.TrimSpace(c.Query("path"))
 	// Path is relative to content service's StateBaseDir (which is /workspace)
 	// Content service handles the base path, so we just pass the relative path
@@ -2285,6 +2348,14 @@ func GetSessionWorkspaceFile(c *gin.Context) {
 		return
 	}
 
+	// Validate user authentication and authorization
+	reqK8s, _ := GetK8sClientsForRequest(c)
+	if reqK8s == nil {
+		c.JSON(http.StatusUnauthorized, gin.H{"error": "Invalid or missing token"})
+		c.Abort()
+		return
+	}
+
 	sub := strings.TrimPrefix(c.Param("path"), "/")
 	// Path is relative to content service's StateBaseDir (which is /workspace)
 	absPath := sub
diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
index da3c28e76..987230788 100644
--- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
+++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
@@ -359,13 +359,15 @@ export default function ProjectSessionDetailPage({
 
       if (data.name && data.inputRepo) {
         try {
+          // Repos are cloned to /workspace/repos/{name}
+          const repoPath = `repos/${data.name}`;
           await fetch(
             `/api/projects/${projectName}/agentic-sessions/${sessionName}/git/configure-remote`,
             {
               method: "POST",
               headers: { "Content-Type": "application/json" },
               body: JSON.stringify({
-                path: data.name,
+                path: repoPath,
                 remoteUrl: data.inputRepo.url,
                 branch: data.inputRepo.branch || "main",
               }),
@@ -373,7 +375,7 @@ export default function ProjectSessionDetailPage({
           );
 
           const newRemotes = { ...directoryRemotes };
-          newRemotes[data.name] = {
+          newRemotes[repoPath] = {
             url: data.inputRepo.url,
             branch: data.inputRepo.branch || "main",
           };
@@ -1382,36 +1384,39 @@ export default function ProjectSessionDetailPage({
           </div>
         </div>
 
-        {/* Mobile: Options menu button (below header border) */}
-        <div className="md:hidden px-6 py-1 bg-card border-b">
-          <Button
-            variant="outline"
-            size="sm"
-            onClick={() => setMobileMenuOpen(!mobileMenuOpen)}
-            className="h-8 w-8 p-0"
-          >
-            <SlidersHorizontal className="h-4 w-4" />
-          </Button>
-        </div>
+        {/* Mobile: Options menu button (below header border) - only show when session is running */}
+        {session?.status?.phase === "Running" && (
+          <div className="md:hidden px-6 py-1 bg-card border-b">
+            <Button
+              variant="outline"
+              size="sm"
+              onClick={() => setMobileMenuOpen(!mobileMenuOpen)}
+              className="h-8 w-8 p-0"
+            >
+              <SlidersHorizontal className="h-4 w-4" />
+            </Button>
+          </div>
+        )}
 
         {/* Main content area */}
         <div className="flex-grow overflow-hidden bg-card">
           <div className="h-full">
             <div className="h-full flex gap-6">
-              {/* Mobile sidebar overlay */}
-              {mobileMenuOpen && (
+              {/* Mobile sidebar overlay - only show when session is running */}
+              {session?.status?.phase === "Running" && mobileMenuOpen && (
                 <div 
                   className="fixed inset-0 bg-background/80 backdrop-blur-sm z-40 md:hidden"
                   onClick={() => setMobileMenuOpen(false)}
                 />
               )}
 
-              {/* Left Column - Accordions */}
-              <div className={cn(
-                "flex-[0_0_400px] min-w-[350px] max-w-[500px] flex flex-col sticky top-0 self-start h-[calc(100vh-8rem)] overflow-y-auto pt-6 pl-6 pr-6 bg-card",
-                "md:flex md:pr-0",
-                mobileMenuOpen ? "fixed left-0 top-16 z-50 shadow-lg" : "hidden"
-              )}>
+              {/* Left Column - Accordions - only show when session is running */}
+              {session?.status?.phase === "Running" && (
+                <div className={cn(
+                  "flex-[0_0_400px] min-w-[350px] max-w-[500px] flex flex-col sticky top-0 self-start h-[calc(100vh-8rem)] overflow-y-auto pt-6 pl-6 pr-6 bg-card",
+                  "md:flex md:pr-0",
+                  mobileMenuOpen ? "fixed left-0 top-16 z-50 shadow-lg" : "hidden"
+                )}>
                 {/* Mobile close button */}
                 <div className="md:hidden flex justify-end mb-4">
                   <Button
@@ -1862,6 +1867,7 @@ export default function ProjectSessionDetailPage({
                   </Accordion>
                 </div>
               </div>
+              )}
 
               {/* Right Column - Messages */}
               <div className="flex-1 min-w-0 flex flex-col">
diff --git a/components/operator/internal/handlers/reconciler.go b/components/operator/internal/handlers/reconciler.go
index de10f5b42..f7982932e 100644
--- a/components/operator/internal/handlers/reconciler.go
+++ b/components/operator/internal/handlers/reconciler.go
@@ -393,58 +393,3 @@ func collectPodErrorMessage(pod *corev1.Pod) string {
 
 	return errorMsg
 }
-
-// WatchAgenticSessionsLegacy is the original watch-based implementation.
-// This is kept for backward compatibility during migration.
-// DEPRECATED: Use controller-runtime based reconciliation instead.
-func WatchAgenticSessionsLegacy() {
-	gvr := types.GetAgenticSessionResource()
-
-	for {
-		// Watch AgenticSessions across all namespaces
-		watcher, err := config.DynamicClient.Resource(gvr).Watch(context.TODO(), v1.ListOptions{})
-		if err != nil {
-			log.Printf("Failed to create AgenticSession watcher: %v", err)
-			time.Sleep(5 * time.Second)
-			continue
-		}
-
-		log.Println("Watching for AgenticSession events across all namespaces...")
-
-		for event := range watcher.ResultChan() {
-			// Reduced logging - only log errors and key events
-			switch event.Type {
-			case "ADDED", "MODIFIED":
-				obj := event.Object.(*unstructured.Unstructured)
-
-				// Only process resources in managed namespaces
-				ns := obj.GetNamespace()
-				if ns == "" {
-					continue
-				}
-				nsObj, err := config.K8sClient.CoreV1().Namespaces().Get(context.TODO(), ns, v1.GetOptions{})
-				if err != nil {
-					continue
-				}
-				if nsObj.Labels["ambient-code.io/managed"] != "true" {
-					continue
-				}
-
-				// Remove the 100ms delay - controller-runtime handles debouncing
-				if err := handleAgenticSessionEvent(obj); err != nil {
-					log.Printf("Error handling AgenticSession event: %v", err)
-				}
-			case "DELETED":
-				obj := event.Object.(*unstructured.Unstructured)
-				log.Printf("AgenticSession %s/%s deleted", obj.GetNamespace(), obj.GetName())
-			case "ERROR":
-				obj := event.Object.(*unstructured.Unstructured)
-				log.Printf("Watch error for AgenticSession: %v", obj)
-			}
-		}
-
-		log.Println("AgenticSession watch channel closed, restarting...")
-		watcher.Stop()
-		time.Sleep(2 * time.Second)
-	}
-}
diff --git a/components/operator/internal/handlers/sessions.go b/components/operator/internal/handlers/sessions.go
index a7d412a2d..867a09e68 100644
--- a/components/operator/internal/handlers/sessions.go
+++ b/components/operator/internal/handlers/sessions.go
@@ -30,73 +30,15 @@ import (
 )
 
 // Track which pods are currently being monitored to prevent duplicate goroutines
+// NOTE: This is used by the legacy handleAgenticSessionEvent function which is
+// kept for reference but no longer actively called by the operator.
+// The controller-runtime based reconciler in internal/controller/ handles all
+// AgenticSession reconciliation now.
 var (
 	monitoredPods   = make(map[string]bool)
 	monitoredPodsMu sync.Mutex
 )
 
-// WatchAgenticSessions watches for AgenticSession custom resources and creates pods
-func WatchAgenticSessions() {
-	gvr := types.GetAgenticSessionResource()
-
-	for {
-		// Watch AgenticSessions across all namespaces
-		watcher, err := config.DynamicClient.Resource(gvr).Watch(context.TODO(), v1.ListOptions{})
-		if err != nil {
-			log.Printf("Failed to create AgenticSession watcher: %v", err)
-			time.Sleep(5 * time.Second)
-			continue
-		}
-
-		log.Println("Watching for AgenticSession events across all namespaces...")
-
-		for event := range watcher.ResultChan() {
-			switch event.Type {
-			case watch.Added, watch.Modified:
-				obj := event.Object.(*unstructured.Unstructured)
-
-				// Only process resources in managed namespaces
-				ns := obj.GetNamespace()
-				if ns == "" {
-					continue
-				}
-				nsObj, err := config.K8sClient.CoreV1().Namespaces().Get(context.TODO(), ns, v1.GetOptions{})
-				if err != nil {
-					log.Printf("Failed to get namespace %s: %v", ns, err)
-					continue
-				}
-				if nsObj.Labels["ambient-code.io/managed"] != "true" {
-					// Skip unmanaged namespaces
-					continue
-				}
-
-				// Add small delay to avoid race conditions with rapid create/delete cycles
-				time.Sleep(100 * time.Millisecond)
-
-				if err := handleAgenticSessionEvent(obj); err != nil {
-					log.Printf("Error handling AgenticSession event: %v", err)
-				}
-			case watch.Deleted:
-				obj := event.Object.(*unstructured.Unstructured)
-				sessionName := obj.GetName()
-				sessionNamespace := obj.GetNamespace()
-				log.Printf("AgenticSession %s/%s deleted", sessionNamespace, sessionName)
-
-				// Cancel any ongoing job monitoring for this session
-				// (We could implement this with a context cancellation if needed)
-				// OwnerReferences handle cleanup of per-session resources
-			case watch.Error:
-				obj := event.Object.(*unstructured.Unstructured)
-				log.Printf("Watch error for AgenticSession: %v", obj)
-			}
-		}
-
-		log.Println("AgenticSession watch channel closed, restarting...")
-		watcher.Stop()
-		time.Sleep(2 * time.Second)
-	}
-}
-
 func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	name := obj.GetName()
 	sessionNamespace := obj.GetNamespace()
@@ -917,6 +859,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 					}(),
 					VolumeMounts: []corev1.VolumeMount{
 						{Name: "workspace", MountPath: "/workspace"},
+						// SubPath mount for .claude so init container writes to same location as runner
+						{Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude"},
 					},
 				},
 			},
@@ -1235,6 +1179,8 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 					},
 					VolumeMounts: []corev1.VolumeMount{
 						{Name: "workspace", MountPath: "/workspace", ReadOnly: false},
+						// SubPath mount for .claude so sync sidecar reads from same location as runner
+						{Name: "workspace", MountPath: "/app/.claude", SubPath: ".claude", ReadOnly: false},
 					},
 					Resources: corev1.ResourceRequirements{
 						Requests: corev1.ResourceList{
diff --git a/components/operator/main.go b/components/operator/main.go
index c71c12709..3eb47a231 100644
--- a/components/operator/main.go
+++ b/components/operator/main.go
@@ -44,7 +44,6 @@ func main() {
 	var enableLeaderElection bool
 	var probeAddr string
 	var maxConcurrentReconciles int
-	var useLegacyWatch bool
 
 	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "The address the metric endpoint binds to.")
 	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "The address the probe endpoint binds to.")
@@ -53,8 +52,6 @@ func main() {
 			"Enabling this will ensure there is only one active controller manager.")
 	flag.IntVar(&maxConcurrentReconciles, "max-concurrent-reconciles", 10,
 		"Maximum number of concurrent Reconciles which can be run. Higher values allow more throughput but consume more resources.")
-	flag.BoolVar(&useLegacyWatch, "legacy-watch", false,
-		"Use legacy watch-based implementation instead of controller-runtime (for debugging only).")
 	flag.Parse()
 
 	// Allow environment variable override for max concurrent reconciles
@@ -77,10 +74,9 @@ func main() {
 	logger.Info("Starting Agentic Session Operator",
 		"maxConcurrentReconciles", maxConcurrentReconciles,
 		"leaderElection", enableLeaderElection,
-		"legacyWatch", useLegacyWatch,
 	)
 
-	// Initialize Kubernetes clients (needed for legacy handlers and config)
+	// Initialize Kubernetes clients (needed for namespace/projectsettings handlers and config)
 	if err := config.InitK8sClients(); err != nil {
 		logger.Error(err, "Failed to initialize Kubernetes clients")
 		os.Exit(1)
@@ -111,13 +107,6 @@ func main() {
 		}
 	}
 
-	// If legacy watch mode is requested, use the old implementation
-	if useLegacyWatch {
-		logger.Info("Using legacy watch-based implementation")
-		runLegacyMode()
-		return
-	}
-
 	// Create controller-runtime manager with increased QPS/Burst to avoid client-side throttling
 	// Default is QPS=5, Burst=10 which causes delays when handling multiple sessions
 	restConfig := ctrl.GetConfigOrDie()
@@ -175,24 +164,6 @@ func main() {
 	}
 }
 
-// runLegacyMode runs the operator using the old watch-based implementation.
-// This is kept for backward compatibility and debugging.
-func runLegacyMode() {
-	log.Println("=== LEGACY MODE: Using watch-based implementation ===")
-
-	// Start watching AgenticSession resources (legacy)
-	go handlers.WatchAgenticSessions()
-
-	// Start watching for managed namespaces
-	go handlers.WatchNamespaces()
-
-	// Start watching ProjectSettings resources
-	go handlers.WatchProjectSettings()
-
-	// Keep the operator running
-	select {}
-}
-
 func logBuildInfo() {
 	log.Println("==============================================")
 	log.Println("Agentic Session Operator - Build Information")
diff --git a/components/runners/claude-code-runner/adapter.py b/components/runners/claude-code-runner/adapter.py
index ad8002f23..4b4e8c30f 100644
--- a/components/runners/claude-code-runner/adapter.py
+++ b/components/runners/claude-code-runner/adapter.py
@@ -790,11 +790,12 @@ def _setup_workflow_paths(self, active_workflow_url: str, repos_cfg: list) -> tu
             logger.warning(f"Failed to derive workflow name: {e}, using default")
             cwd_path = str(Path(self.context.workspace_path) / "workflows" / "default")
 
-        # Add all repos as additional directories
+        # Add all repos as additional directories (repos are in /workspace/repos/{name})
+        repos_base = Path(self.context.workspace_path) / "repos"
         for r in repos_cfg:
             name = (r.get('name') or '').strip()
             if name:
-                repo_path = str(Path(self.context.workspace_path) / name)
+                repo_path = str(repos_base / name)
                 if repo_path not in add_dirs:
                     add_dirs.append(repo_path)
 
@@ -810,8 +811,14 @@ def _setup_workflow_paths(self, active_workflow_url: str, repos_cfg: list) -> tu
         return cwd_path, add_dirs, derived_name
 
     def _setup_multi_repo_paths(self, repos_cfg: list) -> tuple[str, list]:
-        """Setup paths for multi-repo mode."""
+        """Setup paths for multi-repo mode.
+        
+        Repos are cloned to /workspace/repos/{name} by both:
+        - hydrate.sh (init container)
+        - clone_repo_at_runtime() (runtime addition)
+        """
         add_dirs = []
+        repos_base = Path(self.context.workspace_path) / "repos"
         
         main_name = (os.getenv('MAIN_REPO_NAME') or '').strip()
         if not main_name:
@@ -824,13 +831,15 @@ def _setup_multi_repo_paths(self, repos_cfg: list) -> tuple[str, list]:
                 idx_val = 0
             main_name = (repos_cfg[idx_val].get('name') or '').strip()
 
-        cwd_path = str(Path(self.context.workspace_path) / main_name) if main_name else self.context.workspace_path
+        # Main repo path is /workspace/repos/{name}
+        cwd_path = str(repos_base / main_name) if main_name else self.context.workspace_path
 
         for r in repos_cfg:
             name = (r.get('name') or '').strip()
             if not name:
                 continue
-            p = str(Path(self.context.workspace_path) / name)
+            # All repos are in /workspace/repos/{name}
+            p = str(repos_base / name)
             if p != cwd_path:
                 add_dirs.append(p)
 
@@ -1273,9 +1282,10 @@ def _build_workspace_context_prompt(self, repos_cfg, workflow_name, artifacts_pa
 
         if repos_cfg:
             prompt += "## Available Code Repositories\n"
+            prompt += "Location: repos/\n"
             for i, repo in enumerate(repos_cfg):
                 name = repo.get('name', f'repo-{i}')
-                prompt += f"- {name}/\n"
+                prompt += f"- repos/{name}/\n"
             prompt += "\nThese repositories contain source code you can read or modify.\n\n"
 
         if ambient_config.get("systemPrompt"):
diff --git a/components/runners/claude-code-runner/main.py b/components/runners/claude-code-runner/main.py
index afbbefaed..412ce70bd 100644
--- a/components/runners/claude-code-runner/main.py
+++ b/components/runners/claude-code-runner/main.py
@@ -227,12 +227,10 @@ async def event_generator():
         try:
             logger.info("Event generator started")
             
-            # Initialize adapter on first run (yields setup events)
+            # Initialize adapter on first run
             if not _adapter_initialized:
                 logger.info("First run - initializing adapter with workspace preparation")
-                async for event in adapter.initialize(context):
-                    logger.debug(f"Yielding initialization event: {event.type}")
-                    yield encoder.encode(event)
+                await adapter.initialize(context)
                 logger.info("Adapter initialization complete")
                 _adapter_initialized = True
             
@@ -288,6 +286,105 @@ async def interrupt_run():
         raise HTTPException(status_code=500, detail=str(e))
 
 
+async def clone_workflow_at_runtime(git_url: str, branch: str, subpath: str) -> tuple[bool, str]:
+    """
+    Clone a workflow repository at runtime.
+    
+    This mirrors the logic in hydrate.sh but runs when workflows are changed
+    after the pod has started.
+    
+    Returns:
+        (success, workflow_dir_path) tuple
+    """
+    import tempfile
+    import shutil
+    from pathlib import Path
+    
+    if not git_url:
+        return False, ""
+    
+    # Derive workflow name from URL
+    workflow_name = git_url.split("/")[-1].removesuffix(".git")
+    workspace_path = os.getenv("WORKSPACE_PATH", "/workspace")
+    workflow_final = Path(workspace_path) / "workflows" / workflow_name
+    
+    logger.info(f"Cloning workflow '{workflow_name}' from {git_url}@{branch}")
+    if subpath:
+        logger.info(f"  Subpath: {subpath}")
+    
+    # Create temp directory for clone
+    temp_dir = Path(tempfile.mkdtemp(prefix="workflow-clone-"))
+    
+    try:
+        # Build git clone command with optional auth token
+        github_token = os.getenv("GITHUB_TOKEN", "").strip()
+        gitlab_token = os.getenv("GITLAB_TOKEN", "").strip()
+        
+        # Determine which token to use based on URL
+        clone_url = git_url
+        if github_token and "github" in git_url.lower():
+            clone_url = git_url.replace("https://", f"https://x-access-token:{github_token}@")
+            logger.info("Using GITHUB_TOKEN for workflow authentication")
+        elif gitlab_token and "gitlab" in git_url.lower():
+            clone_url = git_url.replace("https://", f"https://oauth2:{gitlab_token}@")
+            logger.info("Using GITLAB_TOKEN for workflow authentication")
+        
+        # Clone the repository
+        process = await asyncio.create_subprocess_exec(
+            "git", "clone", "--branch", branch, "--single-branch", "--depth", "1",
+            clone_url, str(temp_dir),
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.PIPE
+        )
+        stdout, stderr = await process.communicate()
+        
+        if process.returncode != 0:
+            # Redact tokens from error message
+            error_msg = stderr.decode()
+            if github_token:
+                error_msg = error_msg.replace(github_token, "***REDACTED***")
+            if gitlab_token:
+                error_msg = error_msg.replace(gitlab_token, "***REDACTED***")
+            logger.error(f"Failed to clone workflow: {error_msg}")
+            return False, ""
+        
+        logger.info("Clone successful, processing...")
+        
+        # Handle subpath extraction
+        if subpath:
+            subpath_full = temp_dir / subpath
+            if subpath_full.exists() and subpath_full.is_dir():
+                logger.info(f"Extracting subpath: {subpath}")
+                # Remove existing workflow dir if exists
+                if workflow_final.exists():
+                    shutil.rmtree(workflow_final)
+                # Create parent dirs and copy subpath
+                workflow_final.parent.mkdir(parents=True, exist_ok=True)
+                shutil.copytree(subpath_full, workflow_final)
+                logger.info(f"Workflow extracted to {workflow_final}")
+            else:
+                logger.warning(f"Subpath '{subpath}' not found, using entire repo")
+                if workflow_final.exists():
+                    shutil.rmtree(workflow_final)
+                shutil.move(str(temp_dir), str(workflow_final))
+        else:
+            # No subpath - use entire repo
+            if workflow_final.exists():
+                shutil.rmtree(workflow_final)
+            shutil.move(str(temp_dir), str(workflow_final))
+        
+        logger.info(f"Workflow '{workflow_name}' ready at {workflow_final}")
+        return True, str(workflow_final)
+        
+    except Exception as e:
+        logger.error(f"Error cloning workflow: {e}")
+        return False, ""
+    finally:
+        # Cleanup temp directory if it still exists
+        if temp_dir.exists():
+            shutil.rmtree(temp_dir, ignore_errors=True)
+
+
 @app.post("/workflow")
 async def change_workflow(request: Request):
     """
@@ -307,6 +404,13 @@ async def change_workflow(request: Request):
     
     logger.info(f"Workflow change request: {git_url}@{branch} (path: {path})")
     
+    # Clone the workflow repository at runtime
+    # This is needed because the init container only runs once at pod startup
+    if git_url:
+        success, workflow_path = await clone_workflow_at_runtime(git_url, branch, path)
+        if not success:
+            logger.warning("Failed to clone workflow, will use default workflow directory")
+    
     # Update environment variables
     os.environ["ACTIVE_WORKFLOW_GIT_URL"] = git_url
     os.environ["ACTIVE_WORKFLOW_BRANCH"] = branch
@@ -320,12 +424,106 @@ async def change_workflow(request: Request):
     
     # Trigger a new run to greet user with workflow context
     # This runs in background via backend POST
-    import asyncio
     asyncio.create_task(trigger_workflow_greeting(git_url, branch, path))
     
     return {"message": "Workflow updated", "gitUrl": git_url, "branch": branch, "path": path}
 
 
+async def clone_repo_at_runtime(git_url: str, branch: str, name: str) -> tuple[bool, str]:
+    """
+    Clone a repository at runtime.
+    
+    This mirrors the logic in hydrate.sh but runs when repos are added
+    after the pod has started.
+    
+    Args:
+        git_url: Git repository URL
+        branch: Branch to clone
+        name: Name for the cloned directory (derived from URL if empty)
+    
+    Returns:
+        (success, repo_dir_path) tuple
+    """
+    import tempfile
+    import shutil
+    from pathlib import Path
+    
+    if not git_url:
+        return False, ""
+    
+    # Derive repo name from URL if not provided
+    if not name:
+        name = git_url.split("/")[-1].removesuffix(".git")
+    
+    # Repos are stored in /workspace/repos/{name} (matching hydrate.sh)
+    workspace_path = os.getenv("WORKSPACE_PATH", "/workspace")
+    repos_dir = Path(workspace_path) / "repos"
+    repos_dir.mkdir(parents=True, exist_ok=True)
+    repo_final = repos_dir / name
+    
+    logger.info(f"Cloning repo '{name}' from {git_url}@{branch}")
+    
+    # Skip if already cloned
+    if repo_final.exists():
+        logger.info(f"Repo '{name}' already exists at {repo_final}, skipping clone")
+        return True, str(repo_final)
+    
+    # Create temp directory for clone
+    temp_dir = Path(tempfile.mkdtemp(prefix="repo-clone-"))
+    
+    try:
+        # Build git clone command with optional auth token
+        github_token = os.getenv("GITHUB_TOKEN", "").strip()
+        gitlab_token = os.getenv("GITLAB_TOKEN", "").strip()
+        
+        # Determine which token to use based on URL
+        clone_url = git_url
+        if github_token and "github" in git_url.lower():
+            # Add GitHub token to URL
+            clone_url = git_url.replace("https://", f"https://x-access-token:{github_token}@")
+            logger.info("Using GITHUB_TOKEN for authentication")
+        elif gitlab_token and "gitlab" in git_url.lower():
+            # Add GitLab token to URL
+            clone_url = git_url.replace("https://", f"https://oauth2:{gitlab_token}@")
+            logger.info("Using GITLAB_TOKEN for authentication")
+        
+        # Clone the repository
+        process = await asyncio.create_subprocess_exec(
+            "git", "clone", "--branch", branch, "--single-branch", "--depth", "1",
+            clone_url, str(temp_dir),
+            stdout=asyncio.subprocess.PIPE,
+            stderr=asyncio.subprocess.PIPE
+        )
+        stdout, stderr = await process.communicate()
+        
+        if process.returncode != 0:
+            # Redact tokens from error message
+            error_msg = stderr.decode()
+            if github_token:
+                error_msg = error_msg.replace(github_token, "***REDACTED***")
+            if gitlab_token:
+                error_msg = error_msg.replace(gitlab_token, "***REDACTED***")
+            logger.error(f"Failed to clone repo: {error_msg}")
+            return False, ""
+        
+        logger.info("Clone successful, moving to final location...")
+        
+        # Move to final location
+        repo_final.parent.mkdir(parents=True, exist_ok=True)
+        shutil.move(str(temp_dir), str(repo_final))
+        
+        logger.info(f"Repo '{name}' ready at {repo_final}")
+        return True, str(repo_final)
+        
+    except Exception as e:
+        logger.error(f"Error cloning repo: {e}")
+        return False, ""
+    finally:
+        # Cleanup temp directory if it still exists
+        if temp_dir.exists():
+            shutil.rmtree(temp_dir, ignore_errors=True)
+
+
 async def trigger_workflow_greeting(git_url: str, branch: str, path: str):
     """Trigger workflow greeting after workflow change."""
     import uuid
@@ -390,7 +588,7 @@ async def trigger_workflow_greeting(git_url: str, branch: str, path: str):
 @app.post("/repos/add")
 async def add_repo(request: Request):
     """
-    Add repository - triggers Claude SDK client restart.
+    Add repository - clones repo and triggers Claude SDK client restart.
     
     Accepts: {"url": "...", "branch": "...", "name": "..."}
     """
@@ -400,7 +598,23 @@ async def add_repo(request: Request):
         raise HTTPException(status_code=503, detail="Adapter not initialized")
     
     body = await request.json()
-    logger.info(f"Add repo request: {body}")
+    url = body.get("url", "")
+    branch = body.get("branch", "main")
+    name = body.get("name", "")
+    
+    logger.info(f"Add repo request: url={url}, branch={branch}, name={name}")
+    
+    if not url:
+        raise HTTPException(status_code=400, detail="Repository URL is required")
+    
+    # Derive name from URL if not provided
+    if not name:
+        name = url.split("/")[-1].removesuffix(".git")
+    
+    # Clone the repository at runtime
+    success, repo_path = await clone_repo_at_runtime(url, branch, name)
+    if not success:
+        raise HTTPException(status_code=500, detail=f"Failed to clone repository: {url}")
     
     # Update REPOS_JSON env var
     repos_json = os.getenv("REPOS_JSON", "[]")
@@ -411,22 +625,81 @@ async def add_repo(request: Request):
     
     # Add new repo
     repos.append({
-        "name": body.get("name", ""),
+        "name": name,
         "input": {
-            "url": body.get("url", ""),
-            "branch": body.get("branch", "main")
+            "url": url,
+            "branch": branch
         }
     })
     
     os.environ["REPOS_JSON"] = json.dumps(repos)
     
-    # Reset adapter state
+    # Reset adapter state to force reinitialization on next run
     _adapter_initialized = False
     adapter._first_run = True
     
-    logger.info(f"Repo added, adapter will reinitialize on next run")
+    logger.info(f"Repo '{name}' added and cloned, adapter will reinitialize on next run")
+    
+    # Trigger a notification to Claude about the new repository
+    asyncio.create_task(trigger_repo_added_notification(name, url))
+    
+    return {"message": "Repository added", "name": name, "path": repo_path}
+
+
+async def trigger_repo_added_notification(repo_name: str, repo_url: str):
+    """Notify Claude that a repository has been added."""
+    import uuid
+    import aiohttp
+    
+    # Wait a moment for repo to be fully ready
+    await asyncio.sleep(1)
+    
+    logger.info(f"Triggering repo added notification for: {repo_name}")
+    
+    try:
+        backend_url = os.getenv("BACKEND_API_URL", "").rstrip("/")
+        project_name = os.getenv("AGENTIC_SESSION_NAMESPACE", "").strip()
+        session_id = context.session_id if context else "unknown"
+        
+        if not backend_url or not project_name:
+            logger.error("Cannot trigger repo notification: BACKEND_API_URL or PROJECT_NAME not set")
+            return
+        
+        url = f"{backend_url}/projects/{project_name}/agentic-sessions/{session_id}/agui/run"
+        
+        notification = f"The repository '{repo_name}' has been added to your workspace. You can now access it at the path 'repos/{repo_name}/'. Please acknowledge this to the user and let them know you can now read and work with files in this repository."
+        
+        payload = {
+            "threadId": session_id,
+            "runId": str(uuid.uuid4()),
+            "messages": [{
+                "id": str(uuid.uuid4()),
+                "role": "user",
+                "content": notification,
+                "metadata": {
+                    "hidden": True,
+                    "autoSent": True,
+                    "source": "repo_added"
+                }
+            }]
+        }
+        
+        bot_token = os.getenv("BOT_TOKEN", "").strip()
+        headers = {"Content-Type": "application/json"}
+        if bot_token:
+            headers["Authorization"] = f"Bearer {bot_token}"
+        
+        async with aiohttp.ClientSession() as session:
+            async with session.post(url, json=payload, headers=headers) as resp:
+                if resp.status == 200:
+                    result = await resp.json()
+                    logger.info(f"Repo notification sent: {result}")
+                else:
+                    error_text = await resp.text()
+                    logger.error(f"Repo notification failed: {resp.status} - {error_text}")
     
-    return {"message": "Repository added"}
+    except Exception as e:
+        logger.error(f"Failed to trigger repo notification: {e}")
 
 
 @app.post("/repos/remove")
diff --git a/components/runners/state-sync/hydrate.sh b/components/runners/state-sync/hydrate.sh
index 165f198c4..4c33d2ada 100644
--- a/components/runners/state-sync/hydrate.sh
+++ b/components/runners/state-sync/hydrate.sh
@@ -14,11 +14,12 @@ NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}"
 SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}"
 
 # Paths to sync (must match sync.sh)
+# Note: .claude uses /app/.claude (SubPath mount), others use /workspace
 SYNC_PATHS=(
-    ".claude"
     "artifacts"
     "file-uploads"
 )
+CLAUDE_DATA_PATH="/app/.claude"
 
 # Error handler
 error_exit() {
@@ -56,14 +57,15 @@ echo "========================================="
 
 # Create workspace structure
 echo "Creating workspace structure..."
-mkdir -p /workspace/.claude || error_exit "Failed to create .claude directory"
+# .claude is mounted at /app/.claude via SubPath (same location as runner container)
+mkdir -p "${CLAUDE_DATA_PATH}" || error_exit "Failed to create .claude directory"
 mkdir -p /workspace/artifacts || error_exit "Failed to create artifacts directory"
 mkdir -p /workspace/file-uploads || error_exit "Failed to create file-uploads directory"
 mkdir -p /workspace/repos || error_exit "Failed to create repos directory"
 
 # Set permissions on created directories (not root workspace which may be owned by different user)
 # Use 755 instead of 777 - readable by all, writable only by owner
-chmod 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
+chmod 755 "${CLAUDE_DATA_PATH}" /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
 
 # Check if S3 is configured
 if [ -z "${S3_ENDPOINT}" ] || [ -z "${S3_BUCKET}" ] || [ -z "${AWS_ACCESS_KEY_ID}" ] || [ -z "${AWS_SECRET_ACCESS_KEY}" ]; then
@@ -90,7 +92,19 @@ echo "Checking for existing session state in S3..."
 if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/" 2>/dev/null | grep -q .; then
     echo "Found existing session state, downloading from S3..."
     
-    # Download each sync path if it exists
+    # Download .claude data to /app/.claude (SubPath mount matches runner container)
+    if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/.claude/" 2>/dev/null | grep -q .; then
+        echo "  Downloading .claude/..."
+        rclone --config /tmp/.config/rclone/rclone.conf copy "${S3_PATH}/.claude/" "${CLAUDE_DATA_PATH}/" \
+            --copy-links \
+            --transfers 8 \
+            --fast-list \
+            --progress 2>&1 || echo "  Warning: failed to download .claude"
+    else
+        echo "  No data for .claude/"
+    fi
+    
+    # Download other sync paths to /workspace
     for path in "${SYNC_PATHS[@]}"; do
         if rclone --config /tmp/.config/rclone/rclone.conf lsf "${S3_PATH}/${path}/" 2>/dev/null | grep -q .; then
             echo "  Downloading ${path}/..."
@@ -111,7 +125,7 @@ fi
 
 # Set permissions on subdirectories (EmptyDir root may not be chmodable)
 echo "Setting permissions on subdirectories..."
-chmod -R 755 /workspace/.claude /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
+chmod -R 755 "${CLAUDE_DATA_PATH}" /workspace/artifacts /workspace/file-uploads /workspace/repos 2>/dev/null || true
 
 # ========================================
 # Clone repositories and workflows
diff --git a/components/runners/state-sync/sync.sh b/components/runners/state-sync/sync.sh
index 05498ac5f..401ef30d1 100644
--- a/components/runners/state-sync/sync.sh
+++ b/components/runners/state-sync/sync.sh
@@ -16,11 +16,12 @@ NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}"
 SESSION_NAME="${SESSION_NAME//[^a-zA-Z0-9-]/}"
 
 # Paths to sync (non-git content)
+# Note: .claude uses /app/.claude (SubPath mount), others use /workspace
 SYNC_PATHS=(
-    ".claude"
     "artifacts"
     "file-uploads"
 )
+CLAUDE_DATA_PATH="/app/.claude"
 
 # Patterns to exclude from sync
 EXCLUDE_PATTERNS=(
@@ -57,6 +58,14 @@ EOF
 # Check total size before sync
 check_size() {
     local total=0
+    
+    # Check .claude directory size (at /app/.claude via SubPath)
+    if [ -d "${CLAUDE_DATA_PATH}" ]; then
+        size=$(du -sb "${CLAUDE_DATA_PATH}" 2>/dev/null | cut -f1 || echo 0)
+        total=$((total + size))
+    fi
+    
+    # Check other paths in /workspace
     for path in "${SYNC_PATHS[@]}"; do
         if [ -d "/workspace/${path}" ]; then
             size=$(du -sb "/workspace/${path}" 2>/dev/null | cut -f1 || echo 0)
@@ -79,6 +88,26 @@ sync_to_s3() {
     echo "[$(date -Iseconds)] Starting sync to S3..."
     
     local synced=0
+    
+    # Sync .claude data from /app/.claude (SubPath mount matches runner container)
+    if [ -d "${CLAUDE_DATA_PATH}" ]; then
+        echo "  Syncing .claude/..."
+        if rclone --config /tmp/.config/rclone/rclone.conf sync "${CLAUDE_DATA_PATH}" "${s3_path}/.claude/" \
+            --checksum \
+            --copy-links \
+            --transfers 4 \
+            --fast-list \
+            --stats-one-line \
+            --max-size ${MAX_SYNC_SIZE} \
+            $(printf -- '--exclude %s ' "${EXCLUDE_PATTERNS[@]}") \
+            2>&1; then
+            synced=$((synced + 1))
+        else
+            echo "  Warning: sync of .claude had errors"
+        fi
+    fi
+    
+    # Sync other paths from /workspace
     for path in "${SYNC_PATHS[@]}"; do
         if [ -d "/workspace/${path}" ]; then
             echo "  Syncing ${path}/..."

From a27a37f3f275217af208c5bc3b57f3d123afdf10 Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Mon, 5 Jan 2026 21:26:25 -0600
Subject: [PATCH 3/6] refactor: Clean up session handling and remove deprecated
 workspace access endpoints

- Removed deprecated workspace access endpoints from session routes, streamlining API.
- Enhanced session metadata extraction for improved error handling in GetSession.
- Updated comments and TODOs in reconciler and session handler files to reflect ongoing migration to controller-runtime patterns.
---
 components/backend/handlers/sessions.go       | 31 ++++---------
 components/backend/routes.go                  |  2 -
 .../operator/internal/handlers/reconciler.go  |  9 +++-
 .../operator/internal/handlers/sessions.go    | 44 +++++--------------
 4 files changed, 26 insertions(+), 60 deletions(-)

diff --git a/components/backend/handlers/sessions.go b/components/backend/handlers/sessions.go
index 6af2681a8..276d0b019 100644
--- a/components/backend/handlers/sessions.go
+++ b/components/backend/handlers/sessions.go
@@ -748,10 +748,18 @@ func GetSession(c *gin.Context) {
 		return
 	}
 
+	// Safely extract metadata using type-safe pattern
+	metadata, ok := item.Object["metadata"].(map[string]interface{})
+	if !ok {
+		log.Printf("GetSession: invalid metadata for session %s", sessionName)
+		c.JSON(http.StatusInternalServerError, gin.H{"error": "Invalid session metadata"})
+		return
+	}
+
 	session := types.AgenticSession{
 		APIVersion: item.GetAPIVersion(),
 		Kind:       item.GetKind(),
-		Metadata:   item.Object["metadata"].(map[string]interface{}),
+		Metadata:   metadata,
 	}
 
 	if spec, ok := item.Object["spec"].(map[string]interface{}); ok {
@@ -2102,27 +2110,6 @@ func StopSession(c *gin.Context) {
 	c.JSON(http.StatusAccepted, session)
 }
 
-// EnableWorkspaceAccess is deprecated - temporary content pods have been removed
-// POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/enable
-func EnableWorkspaceAccess(c *gin.Context) {
-	c.JSON(http.StatusGone, gin.H{
-		"error":   "Temporary workspace access has been removed",
-		"message": "Session artifacts are now stored in S3. Access artifacts directly from your S3 bucket.",
-		"hint":    "Configure S3 storage in project settings to persist session state and artifacts.",
-		"s3Path":  fmt.Sprintf("s3://{bucket}/{namespace}/%s/", c.Param("sessionName")),
-	})
-}
-
-// TouchWorkspaceAccess updates the last-accessed timestamp to keep temp pod alive
-// POST /api/projects/:projectName/agentic-sessions/:sessionName/workspace/touch
-func TouchWorkspaceAccess(c *gin.Context) {
-	// Deprecated: Temp-content pods no longer exist
-	c.JSON(http.StatusGone, gin.H{
-		"error":   "Temporary workspace access has been removed",
-		"message": "Session artifacts are stored in S3 and do not require touch/keepalive.",
-	})
-}
-
 // GetSessionK8sResources returns job, pod, and PVC information for a session
 // GET /api/projects/:projectName/agentic-sessions/:sessionName/k8s-resources
 func GetSessionK8sResources(c *gin.Context) {
diff --git a/components/backend/routes.go b/components/backend/routes.go
index 7e8c95df4..539ca4ea5 100644
--- a/components/backend/routes.go
+++ b/components/backend/routes.go
@@ -56,8 +56,6 @@ func registerRoutes(r *gin.Engine) {
 			projectGroup.POST("/agentic-sessions/:sessionName/clone", handlers.CloneSession)
 			projectGroup.POST("/agentic-sessions/:sessionName/start", handlers.StartSession)
 			projectGroup.POST("/agentic-sessions/:sessionName/stop", handlers.StopSession)
-			projectGroup.POST("/agentic-sessions/:sessionName/workspace/enable", handlers.EnableWorkspaceAccess)
-			projectGroup.POST("/agentic-sessions/:sessionName/workspace/touch", handlers.TouchWorkspaceAccess)
 			projectGroup.GET("/agentic-sessions/:sessionName/workspace", handlers.ListSessionWorkspace)
 			projectGroup.GET("/agentic-sessions/:sessionName/workspace/*path", handlers.GetSessionWorkspaceFile)
 			projectGroup.PUT("/agentic-sessions/:sessionName/workspace/*path", handlers.PutSessionWorkspaceFile)
diff --git a/components/operator/internal/handlers/reconciler.go b/components/operator/internal/handlers/reconciler.go
index f7982932e..a6e079fa9 100644
--- a/components/operator/internal/handlers/reconciler.go
+++ b/components/operator/internal/handlers/reconciler.go
@@ -10,15 +10,20 @@ import (
 	"time"
 
 	corev1 "k8s.io/api/core/v1"
-	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 
 	"ambient-code-operator/internal/config"
-	"ambient-code-operator/internal/types"
 )
 
 // ReconcilePendingSession handles the Pending phase - creates pod and services.
 // This is the main entry point called from the controller for pending sessions.
+//
+// TODO(controller-runtime-migration): This is a transitional wrapper around the legacy
+// handleAgenticSessionEvent() function (2,300+ lines). Future work should:
+// 1. Extract phase-specific logic into separate functions (ReconcilePending, ReconcileRunning, etc.)
+// 2. Use controller-runtime patterns (Patch, StatusWriter, etc.) instead of direct API calls
+// 3. Remove handleAgenticSessionEvent() entirely
+// This approach allows adopting controller-runtime framework without rewriting all logic at once.
 func ReconcilePendingSession(ctx context.Context, session *unstructured.Unstructured, appConfig *config.Config) error {
 	// Delegate to existing handleAgenticSessionEvent logic
 	// This is a wrapper that allows the existing code to be called from the controller
diff --git a/components/operator/internal/handlers/sessions.go b/components/operator/internal/handlers/sessions.go
index 867a09e68..2fab8f408 100644
--- a/components/operator/internal/handlers/sessions.go
+++ b/components/operator/internal/handlers/sessions.go
@@ -25,20 +25,25 @@ import (
 	v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/apis/meta/v1/unstructured"
 	intstr "k8s.io/apimachinery/pkg/util/intstr"
-	"k8s.io/apimachinery/pkg/watch"
 	"k8s.io/client-go/util/retry"
 )
 
 // Track which pods are currently being monitored to prevent duplicate goroutines
-// NOTE: This is used by the legacy handleAgenticSessionEvent function which is
-// kept for reference but no longer actively called by the operator.
-// The controller-runtime based reconciler in internal/controller/ handles all
-// AgenticSession reconciliation now.
 var (
 	monitoredPods   = make(map[string]bool)
 	monitoredPodsMu sync.Mutex
 )
 
+// handleAgenticSessionEvent is the legacy reconciliation function containing all session
+// lifecycle logic (~2,300 lines). It's called by ReconcilePendingSession() wrapper.
+//
+// TODO(controller-runtime-migration): This function should be refactored into smaller,
+// phase-specific reconcilers that use controller-runtime patterns. Current architecture:
+// - ✅ Controller-runtime framework adopted (work queue, leader election, metrics)
+// - ⚠️ Business logic still uses legacy patterns (direct API calls, manual status updates)
+// - 🔜 Future: Break into ReconcilePending, ReconcileRunning, ReconcileStopped functions
+//
+// This transitional approach allows framework adoption without rewriting 2,300 lines at once.
 func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 	name := obj.GetName()
 	sessionNamespace := obj.GetNamespace()
@@ -681,35 +686,6 @@ func handleAgenticSessionEvent(obj *unstructured.Unstructured) error {
 		})
 	}
 
-	// Extract repos configuration (simplified format: url and branch)
-	type RepoConfig struct {
-		URL    string
-		Branch string
-	}
-
-	var repos []RepoConfig
-
-	// Read repos[] array format
-	if reposArr, found, _ := unstructured.NestedSlice(spec, "repos"); found && len(reposArr) > 0 {
-		repos = make([]RepoConfig, 0, len(reposArr))
-		for _, repoItem := range reposArr {
-			if repoMap, ok := repoItem.(map[string]interface{}); ok {
-				repo := RepoConfig{}
-				if url, ok := repoMap["url"].(string); ok {
-					repo.URL = url
-				}
-				if branch, ok := repoMap["branch"].(string); ok {
-					repo.Branch = branch
-				} else {
-					repo.Branch = "main"
-				}
-				if repo.URL != "" {
-					repos = append(repos, repo)
-				}
-			}
-		}
-	}
-
 	// Read autoPushOnComplete flag
 	autoPushOnComplete, _, _ := unstructured.NestedBool(spec, "autoPushOnComplete")
 

From 512b7ec9b19cec296b676029bb01441b261e2872 Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Mon, 5 Jan 2026 22:11:28 -0600
Subject: [PATCH 4/6] refactor: Clean up code formatting and improve
 readability

- Removed unnecessary blank lines in agenticsession_controller.go and reconcile_phases.go for better code clarity.
- Standardized the formatting of metric variable declarations in otel_metrics.go to enhance consistency across the file.
---
 .../internal/controller/agenticsession_controller.go |  1 -
 .../operator/internal/controller/otel_metrics.go     | 10 +++++-----
 .../operator/internal/controller/reconcile_phases.go | 12 +++++-------
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/components/operator/internal/controller/agenticsession_controller.go b/components/operator/internal/controller/agenticsession_controller.go
index f85e2167a..a33c9778f 100644
--- a/components/operator/internal/controller/agenticsession_controller.go
+++ b/components/operator/internal/controller/agenticsession_controller.go
@@ -281,7 +281,6 @@ func (r *AgenticSessionReconciler) SetupWithManager(mgr ctrl.Manager) error {
 	return nil
 }
 
-
 // GetGVR returns the GroupVersionResource for AgenticSession
 func GetGVR() schema.GroupVersionResource {
 	return optypes.GetAgenticSessionResource()
diff --git a/components/operator/internal/controller/otel_metrics.go b/components/operator/internal/controller/otel_metrics.go
index d6c4fba0f..a6101118e 100644
--- a/components/operator/internal/controller/otel_metrics.go
+++ b/components/operator/internal/controller/otel_metrics.go
@@ -38,11 +38,11 @@ var (
 	sessionsByProject       metric.Int64Counter
 
 	// Error metrics (counters)
-	reconcileRetries     metric.Int64Counter
-	sessionTimeouts      metric.Int64Counter
-	s3Errors             metric.Int64Counter
-	tokenRefreshErrors   metric.Int64Counter
-	podRestarts          metric.Int64Counter
+	reconcileRetries   metric.Int64Counter
+	sessionTimeouts    metric.Int64Counter
+	s3Errors           metric.Int64Counter
+	tokenRefreshErrors metric.Int64Counter
+	podRestarts        metric.Int64Counter
 )
 
 // InitMetrics initializes OpenTelemetry metrics
diff --git a/components/operator/internal/controller/reconcile_phases.go b/components/operator/internal/controller/reconcile_phases.go
index 3dbe4db62..082b53772 100644
--- a/components/operator/internal/controller/reconcile_phases.go
+++ b/components/operator/internal/controller/reconcile_phases.go
@@ -91,11 +91,11 @@ func recordImagePullDuration(namespace string, pod *corev1.Pod) {
 
 	// Check all containers for image pull timing
 	for _, cs := range pod.Status.ContainerStatuses {
-		if cs.State.Running != nil && cs.State.Running.StartedAt.Time.After(podCreated) {
+		if cs.State.Running != nil && cs.State.Running.StartedAt.After(podCreated) {
 			// Approximate image pull duration as time from pod creation to container start
 			// This includes scheduling + image pull + container creation
-			duration := cs.State.Running.StartedAt.Time.Sub(podCreated).Seconds()
-			
+			duration := cs.State.Running.StartedAt.Sub(podCreated).Seconds()
+
 			// Extract image name (remove tag/digest for cleaner metrics)
 			image := cs.Image
 			if idx := strings.Index(image, "@"); idx != -1 {
@@ -103,9 +103,9 @@ func recordImagePullDuration(namespace string, pod *corev1.Pod) {
 			} else if idx := strings.LastIndex(image, ":"); idx != -1 {
 				image = image[:idx]
 			}
-			
+
 			RecordImagePullDuration(namespace, image, duration)
-			
+
 			// Log for first container only (usually the runner)
 			log.Log.Info("Image pull completed",
 				"namespace", namespace,
@@ -147,7 +147,6 @@ func recordStartupTime(namespace, sessionName string, session *unstructured.Unst
 	)
 }
 
-
 // reconcilePending handles sessions in Pending phase.
 // This creates the runner pod and transitions to Creating phase.
 func (r *AgenticSessionReconciler) reconcilePending(ctx context.Context, session *unstructured.Unstructured) (ctrl.Result, error) {
@@ -379,4 +378,3 @@ func (r *AgenticSessionReconciler) reconcileStopping(ctx context.Context, sessio
 	// Requeue to check again
 	return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
 }
-

From 54b33821f3d9046c00af0aa4346844cee7f3579c Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Mon, 5 Jan 2026 23:46:14 -0600
Subject: [PATCH 5/6] refactor: Improve session detail and message handling

- Updated repository path handling in ProjectSessionDetailPage to ensure consistency in workspace structure.
- Enhanced conditional display logic for the welcome experience based on session status, improving user interaction.
- Refined chat interface visibility logic in MessagesTab to only show when the session is in the Running state, clarifying user expectations.
- Adjusted dropdown menu visibility to only appear when there are stream messages, streamlining the UI.
---
 .../[name]/sessions/[sessionName]/page.tsx    |  5 ++-
 .../src/components/session/MessagesTab.tsx    | 39 ++++++++++---------
 2 files changed, 24 insertions(+), 20 deletions(-)

diff --git a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
index 987230788..2617325b9 100644
--- a/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
+++ b/components/frontend/src/app/projects/[name]/sessions/[sessionName]/page.tsx
@@ -624,10 +624,11 @@ export default function ProjectSessionDetailPage({
     if (session?.spec?.repos) {
       session.spec.repos.forEach((repo, idx) => {
         const repoName = repo.url.split('/').pop()?.replace('.git', '') || `repo-${idx}`;
+        // Repos are cloned to /workspace/repos/{name}
         options.push({
           type: "repo",
           name: repoName,
-          path: repoName,
+          path: `repos/${repoName}`,
         });
       });
     }
@@ -1905,7 +1906,7 @@ export default function ProjectSessionDetailPage({
                         workflowMetadata={workflowMetadata}
                         onCommandClick={handleCommandClick}
                         isRunActive={isRunActive}
-                        showWelcomeExperience={true}
+                        showWelcomeExperience={!["Completed", "Failed", "Stopped", "Stopping"].includes(session?.status?.phase || "")}
                         activeWorkflow={workflowManagement.activeWorkflow}
                         userHasInteracted={userHasInteracted}
                         queuedMessages={sessionQueue.messages}
diff --git a/components/frontend/src/components/session/MessagesTab.tsx b/components/frontend/src/components/session/MessagesTab.tsx
index 3e45d60b6..e37891e8a 100644
--- a/components/frontend/src/components/session/MessagesTab.tsx
+++ b/components/frontend/src/components/session/MessagesTab.tsx
@@ -63,8 +63,9 @@ const MessagesTab: React.FC<MessagesTabProps> = ({ session, streamMessages, chat
   const phase = session?.status?.phase || "";
   const isInteractive = session?.spec?.interactive;
   
-  // Show chat interface when session is interactive AND (in Running state OR showing welcome experience)
-  const showChatInterface = isInteractive && (phase === "Running" || showWelcomeExperience);
+  // Show chat interface only when session is interactive AND Running
+  // Welcome experience can be shown during Pending/Creating, but chat input only when Running
+  const showChatInterface = isInteractive && phase === "Running";
   
   // Determine if session is in a terminal state
   const isTerminalState = ["Completed", "Failed", "Stopped"].includes(phase);
@@ -713,26 +714,28 @@ const MessagesTab: React.FC<MessagesTabProps> = ({ session, streamMessages, chat
         </div>
       )}
 
-      {isInteractive && !showChatInterface && streamMessages.length > 0 && (
+      {isInteractive && !showChatInterface && (streamMessages.length > 0 || isCreating || isTerminalState) && (
         <div className="sticky bottom-0 border-t bg-muted/50">
           <div className="p-3">
             <div className="flex items-center justify-between">
               <div className="flex items-center gap-2">
-                <DropdownMenu>
-                  <DropdownMenuTrigger asChild>
-                    <Button variant="ghost" size="sm" className="h-7 w-7 p-0">
-                      <Settings className="h-4 w-4" />
-                    </Button>
-                  </DropdownMenuTrigger>
-                  <DropdownMenuContent align="start">
-                    <DropdownMenuCheckboxItem
-                      checked={showSystemMessages}
-                      onCheckedChange={setShowSystemMessages}
-                    >
-                      Show system messages
-                    </DropdownMenuCheckboxItem>
-                  </DropdownMenuContent>
-                </DropdownMenu>
+                {streamMessages.length > 0 && (
+                  <DropdownMenu>
+                    <DropdownMenuTrigger asChild>
+                      <Button variant="ghost" size="sm" className="h-7 w-7 p-0">
+                        <Settings className="h-4 w-4" />
+                      </Button>
+                    </DropdownMenuTrigger>
+                    <DropdownMenuContent align="start">
+                      <DropdownMenuCheckboxItem
+                        checked={showSystemMessages}
+                        onCheckedChange={setShowSystemMessages}
+                      >
+                        Show system messages
+                      </DropdownMenuCheckboxItem>
+                    </DropdownMenuContent>
+                  </DropdownMenu>
+                )}
                 <p className="text-sm text-muted-foreground">
                   {isCreating && "Chat will be available once the session is running..."}
                   {isTerminalState && (

From 9f34b80a6b1a91a2a4c38fefd44abfedf175c001 Mon Sep 17 00:00:00 2001
From: Gage Krumbach <gkrumbach@gmail.com>
Date: Mon, 5 Jan 2026 23:54:11 -0600
Subject: [PATCH 6/6] feat: Add state-sync component and observability stack
 deployment

- Introduced a new state-sync component in the build and deploy workflows, enhancing the deployment process.
- Added steps to deploy the observability stack in both components-build-deploy and prod-release-deploy workflows.
- Updated kustomization to include the state-sync image for consistent image tagging across environments.
- Enhanced environment variable settings to include the state-sync image in deployment configurations.
---
 .github/workflows/components-build-deploy.yml | 17 +++++++++++++++--
 .github/workflows/prod-release-deploy.yaml    | 12 +++++++++++-
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/components-build-deploy.yml b/.github/workflows/components-build-deploy.yml
index b3ea8bf6f..5bf6b5a88 100644
--- a/.github/workflows/components-build-deploy.yml
+++ b/.github/workflows/components-build-deploy.yml
@@ -84,6 +84,11 @@ jobs:
             image: quay.io/ambient_code/vteam_claude_runner
             dockerfile: ./components/runners/claude-code-runner/Dockerfile
             changed: ${{ needs.detect-changes.outputs.claude-runner }}
+          - name: state-sync
+            context: ./components/runners
+            image: quay.io/ambient_code/vteam_state_sync
+            dockerfile: ./components/runners/state-sync/Dockerfile
+            changed: ${{ needs.detect-changes.outputs.claude-runner }}
     steps:
       - name: Checkout code
         if: matrix.component.changed == 'true' || github.event_name == 'workflow_dispatch'
@@ -163,6 +168,10 @@ jobs:
           oc apply -k components/manifests/base/rbac/
           oc apply -f components/manifests/overlays/production/operator-config-openshift.yaml -n ambient-code
 
+      - name: Deploy observability stack
+        run: |
+          oc apply -k components/manifests/observability/
+
   deploy-to-openshift:
     runs-on: ubuntu-latest
     needs: [detect-changes, build-and-push, update-rbac-and-crd]
@@ -220,6 +229,7 @@ jobs:
           kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }}
           kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${{ steps.image-tags.outputs.operator_tag }}
           kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${{ steps.image-tags.outputs.runner_tag }}
+          kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${{ steps.image-tags.outputs.runner_tag }}
 
       - name: Validate kustomization
         working-directory: components/manifests/overlays/production
@@ -250,7 +260,8 @@ jobs:
         run: |
           oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \
             AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:${{ steps.image-tags.outputs.runner_tag }}" \
-            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }}"
+            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ steps.image-tags.outputs.backend_tag }}" \
+            STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:${{ steps.image-tags.outputs.runner_tag }}"
 
   deploy-with-disptach:
     runs-on: ubuntu-latest
@@ -282,6 +293,7 @@ jobs:
           kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:stage
           kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:stage
           kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:stage
+          kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:stage
 
       - name: Validate kustomization
         working-directory: components/manifests/overlays/production
@@ -309,4 +321,5 @@ jobs:
         run: |
           oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \
             AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:stage" \
-            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:stage"
+            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:stage" \
+            STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:stage"
diff --git a/.github/workflows/prod-release-deploy.yaml b/.github/workflows/prod-release-deploy.yaml
index fc4f198f4..27b644355 100644
--- a/.github/workflows/prod-release-deploy.yaml
+++ b/.github/workflows/prod-release-deploy.yaml
@@ -158,6 +158,10 @@ jobs:
             context: ./components/runners
             image: quay.io/ambient_code/vteam_claude_runner
             dockerfile: ./components/runners/claude-code-runner/Dockerfile
+          - name: state-sync
+            context: ./components/runners
+            image: quay.io/ambient_code/vteam_state_sync
+            dockerfile: ./components/runners/state-sync/Dockerfile
     steps:
       - name: Checkout code from the tag generated above
         uses: actions/checkout@v5
@@ -221,6 +225,10 @@ jobs:
         run: |
           oc login ${{ secrets.PROD_OPENSHIFT_SERVER }} --token=${{ secrets.PROD_OPENSHIFT_TOKEN }} --insecure-skip-tls-verify
 
+      - name: Deploy observability stack
+        run: |
+          oc apply -k components/manifests/observability/
+
       - name: Update kustomization with release image tags
         working-directory: components/manifests/overlays/production
         run: |
@@ -229,6 +237,7 @@ jobs:
           kustomize edit set image quay.io/ambient_code/vteam_backend:latest=quay.io/ambient_code/vteam_backend:${RELEASE_TAG}
           kustomize edit set image quay.io/ambient_code/vteam_operator:latest=quay.io/ambient_code/vteam_operator:${RELEASE_TAG}
           kustomize edit set image quay.io/ambient_code/vteam_claude_runner:latest=quay.io/ambient_code/vteam_claude_runner:${RELEASE_TAG}
+          kustomize edit set image quay.io/ambient_code/vteam_state_sync:latest=quay.io/ambient_code/vteam_state_sync:${RELEASE_TAG}
 
       - name: Validate kustomization
         working-directory: components/manifests/overlays/production
@@ -256,4 +265,5 @@ jobs:
         run: |
           oc set env deployment/agentic-operator -n ambient-code -c agentic-operator \
             AMBIENT_CODE_RUNNER_IMAGE="quay.io/ambient_code/vteam_claude_runner:${{ needs.release.outputs.new_tag }}" \
-            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ needs.release.outputs.new_tag }}"
+            CONTENT_SERVICE_IMAGE="quay.io/ambient_code/vteam_backend:${{ needs.release.outputs.new_tag }}" \
+            STATE_SYNC_IMAGE="quay.io/ambient_code/vteam_state_sync:${{ needs.release.outputs.new_tag }}"