🤖 ci: deflake MCP screenshot integration test (#1173)

ThomasK33 · web-flow · commit 9aee49bdae77 · 2025-12-15T16:02:57.000Z
Deflake MCP Chrome screenshot integration test.

Changes:
- Stop leaking MCP/Chrome processes by ensuring `setupWorkspace()` tears
down the workspace via `workspace.remove(...)` and disposes services
during test cleanup.
- Make the screenshot assertions deterministic by forcing a
`chrome_take_screenshot` tool call via `toolPolicy: require` (no longer
depends on the model deciding to use tools).
- Reduce CI variance by pinning `chrome-devtools-mcp` and using a fixed
viewport; run PNG/JPEG cases sequentially; increase the
post-`stream-end` tool-call wait.

Validation:
- `make static-check`

---

&lt;details&gt;
&lt;summary&gt;📋 Implementation Plan&lt;/summary&gt;

# Deflake: `tests/ipc/mcpConfig.test.ts` MCP screenshot test

## What’s failing
- **Test:** `MCP server integration with model › MCP PNG image content
is correctly transformed to AI SDK format`
- **Failure:** `waitForToolCallEnd(…, "chrome_take_screenshot", …)`
returns `undefined` → no matching `tool-call-end` event.

## Likely root causes (ranked)
1. **Model nondeterminism:** the prompt uses a well-known page
(`example.com`), so the model can sometimes answer from prior knowledge
and skip `chrome_take_screenshot` entirely.
2. **Leaked MCP server processes between tests:** `setupWorkspace()` (in
`tests/ipc/setup.ts`) never calls `workspace.remove`, so MCP servers
started during a test can keep running. That matches the suite’s “Force
exiting Jest…” warning and can cause resource contention / sporadic MCP
startup failures.
3. **Timing/resource contention:** the test runs PNG+JPEG cases
concurrently and starts headless Chrome via `npx`; on slower CI hosts,
tool execution and event delivery may exceed the current 20s polling
window.

&lt;details&gt;
&lt;summary&gt;🔎 Evidence in repo&lt;/summary&gt;

- `tests/ipc/setup.ts::setupWorkspace()` cleanup only deletes temp dirs;
it does **not** call `env.orpc.workspace.remove({ workspaceId })`.
- `WorkspaceService.remove()` explicitly stops MCP servers via
`mcpServerManager.stopServers(workspaceId)`.
- `mcpConfig.test.ts` depends on the model choosing to call
`chrome_take_screenshot` (not enforced).
&lt;/details&gt;

---

## Recommended approach (A): Keep the integration test, but make it
deterministic
**Net new product LoC:** ~0 (test/harness only)

### A1) Fix cleanup so MCP servers don’t leak across tests
1. Update `tests/ipc/setup.ts::setupWorkspace()`’s `cleanup()` to:
- `await env.orpc.workspace.remove({ workspaceId }).catch(() =&gt; {})`
(must run **before** deleting `env.tempDir`)
- `await env.services.dispose()` (clears MCP idle interval + terminates
background procs)
- then run existing `cleanupTestEnvironment(env)` +
`cleanupTempGitRepo(tempGitRepo)`

This should eliminate orphaned Chrome/MCP processes and reduce CI flake
across the whole integration suite.

### A2) Stop relying on the model “choosing” to call screenshot tools
Modify `mcpConfig.test.ts` so the test asserts the transformation path
without depending on free-form model behavior.

Concrete options (pick one):

**Option 1 (preferred): force the tool call using `toolPolicy: require`
and don’t assert the description**
- Send a minimal prompt like:
  - PNG: “Call `chrome_take_screenshot` now.”
  - JPEG: “Call `chrome_take_screenshot` with format \"jpeg\".”
- Pass `options.toolPolicy = [{ regex_match: "chrome_take_screenshot",
action: "require" }]`.
- Only assert:
  - a `tool-call-end` event exists for `chrome_take_screenshot`
  - `assertValidScreenshotResult(…, mediaTypePattern)` passes
- Drop (or relax) `assertModelDescribesScreenshot()`; it adds LLM-output
flake and isn’t needed to validate the MCP→AI-SDK media transformation.

**Option 2: split into two required calls (navigate then screenshot)**
- Message 1: require `chrome_navigate_page` and instruct URL.
- Message 2: require `chrome_take_screenshot`.
- This is useful only if we still want to validate “example.com”
specifically; otherwise it’s extra moving parts.

### A3) Reduce environment-driven variance
- Pin the MCP server version: replace `chrome-devtools-mcp@latest` with
the currently observed version (`chrome-devtools-mcp@0.12.1`).
- Add a deterministic viewport (smaller = faster + avoids huge PNGs):
`--viewport 1280x720`.
- If CI still flakes, run PNG/JPEG sequentially (remove
`test.concurrent.each`).
- Increase `waitForToolCallEnd` timeout from 20s → 60s (CI headless
Chrome can be slow).

---

## Alternative approach (B): Move correctness to unit tests; keep only a
small integration smoke test
**Net new product LoC:** ~0

1. Add a unit test suite for `src/node/services/mcpResultTransform.ts`:
- converts MCP `{ content: [{type:"image", data, mimeType}] }` → `{
type:"content", value:[{type:"media", …}] }`
   - preserves `mimeType` → `mediaType`
- validates the size guard behavior (`MAX_IMAGE_DATA_BYTES`)
deterministically
2. Replace the flaky Chrome+model integration assertion with:
- existing `memory_create_entities` MCP integration test (already
present)
- optional: a chrome MCP “tools available” test (no screenshot, no
model)

Use this if we decide that a full Chrome+LLM flow is too expensive/flaky
for CI.

---

## Optional product hardening (nice-to-have)
**Net new product LoC:** ~20–60

- Consider making `MCPServerManager.dispose()` stop all running
workspace servers (not just clear the idle interval). This would harden
app shutdown behavior and prevent long-lived processes in any non-test
embedding.

---

## Validation
- Run the failing test in a loop (CI-like):
- `TEST_INTEGRATION=1 bun x jest tests/ipc/mcpConfig.test.ts -t "image
content" --runInBand`
- Confirm:
  - `tool-call-end` for `chrome_take_screenshot` is always present
- no lingering Node handles (the “Force exiting Jest…” warning
disappears or is reduced)

&lt;/details&gt;

---
_Generated with `mux` • Model: `openai:gpt-5.2` • Thinking: `xhigh`_

---------

Signed-off-by: Thomas Kosiewski &lt;tk@coder.com&gt;
diff --git a/src/node/services/agentSession.disposeRace.test.ts b/src/node/services/agentSession.disposeRace.test.ts
@@ -0,0 +1,112 @@
+import { describe, expect, test, mock } from "bun:test";
+import { AgentSession } from "./agentSession";
+import type { Config } from "@/node/config";
+import type { HistoryService } from "./historyService";
+import type { PartialService } from "./partialService";
+import type { AIService } from "./aiService";
+import type { InitStateManager } from "./initStateManager";
+import type { BackgroundProcessManager } from "./backgroundProcessManager";
+import type { Result } from "@/common/types/result";
+import { Ok } from "@/common/types/result";
+
+function createDeferred<T>(): {
+  promise: Promise<T>;
+  resolve: (value: T) => void;
+} {
+  let resolve!: (value: T) => void;
+  const promise = new Promise<T>((res) => {
+    resolve = res;
+  });
+  return { promise, resolve };
+}
+
+describe("AgentSession disposal race conditions", () => {
+  test("does not crash if disposed while auto-sending a queued message", async () => {
+    const aiHandlers = new Map<string, (...args: unknown[]) => void>();
+
+    const streamMessage = mock(() => Promise.resolve(Ok(undefined)));
+
+    const aiService: AIService = {
+      on(eventName: string | symbol, listener: (...args: unknown[]) => void) {
+        aiHandlers.set(String(eventName), listener);
+        return this;
+      },
+      off(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
+        return this;
+      },
+      stopStream: mock(() => Promise.resolve(Ok(undefined))),
+      isStreaming: mock(() => false),
+      streamMessage,
+    } as unknown as AIService;
+
+    const appendDeferred = createDeferred<Result<void>>();
+    const historyService: HistoryService = {
+      appendToHistory: mock(() => appendDeferred.promise),
+    } as unknown as HistoryService;
+
+    const initStateManager: InitStateManager = {
+      on(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
+        return this;
+      },
+      off(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
+        return this;
+      },
+    } as unknown as InitStateManager;
+
+    const backgroundProcessManager: BackgroundProcessManager = {
+      cleanup: mock(() => Promise.resolve()),
+      setMessageQueued: mock(() => undefined),
+    } as unknown as BackgroundProcessManager;
+
+    const config: Config = { srcDir: "/tmp" } as unknown as Config;
+    const partialService: PartialService = {} as unknown as PartialService;
+
+    const session = new AgentSession({
+      workspaceId: "ws",
+      config,
+      historyService,
+      partialService,
+      aiService,
+      initStateManager,
+      backgroundProcessManager,
+    });
+
+    // Capture the fire-and-forget sendMessage() promise that sendQueuedMessages() creates.
+    const originalSendMessage = session.sendMessage.bind(session);
+    let inFlight: Promise<unknown> | undefined;
+    (session as unknown as { sendMessage: typeof originalSendMessage }).sendMessage = (
+      ...args: Parameters<typeof originalSendMessage>
+    ) => {
+      const promise = originalSendMessage(...args);
+      inFlight = promise;
+      return promise;
+    };
+
+    session.queueMessage("Queued message", { model: "anthropic:claude-sonnet-4-5" });
+    session.sendQueuedMessages();
+
+    expect(inFlight).toBeDefined();
+
+    // Dispose while sendMessage() is awaiting appendToHistory.
+    session.dispose();
+    appendDeferred.resolve(Ok(undefined));
+
+    const result = await (inFlight as Promise<Result<void>>);
+    expect(result.success).toBe(true);
+
+    // We should not attempt to stream once disposal has begun.
+    expect(streamMessage).toHaveBeenCalledTimes(0);
+
+    // Sanity: invoking a forwarded handler after dispose should be a no-op.
+    const streamStart = aiHandlers.get("stream-start");
+    expect(() =>
+      streamStart?.({
+        type: "stream-start",
+        workspaceId: "ws",
+        messageId: "m1",
+        model: "anthropic:claude-sonnet-4-5",
+        timestamp: Date.now(),
+      })
+    ).not.toThrow();
+  });
+});
diff --git a/src/node/services/agentSession.ts b/src/node/services/agentSession.ts
@@ -453,13 +453,23 @@ export class AgentSession {
       return Err(createUnknownSendMessageError(appendResult.error));
     }
 
+    // Workspace may be tearing down while we await filesystem IO.
+    // If so, skip event emission + streaming to avoid races with dispose().
+    if (this.disposed) {
+      return Ok(undefined);
+    }
+
     // Add type: "message" for discriminated union (createMuxMessage doesn't add it)
     this.emitChatEvent({ ...userMessage, type: "message" });
 
     // If this is a compaction request, terminate background processes first
     // They won't be included in the summary, so continuing with orphaned processes would be confusing
     if (isCompactionRequestMetadata(typedMuxMetadata)) {
       await this.backgroundProcessManager.cleanup(this.workspaceId);
+
+      if (this.disposed) {
+        return Ok(undefined);
+      }
     }
 
     // If this is a compaction request with a continue message, queue it for auto-send after compaction
@@ -501,6 +511,10 @@ export class AgentSession {
       this.emitQueuedMessageChanged();
     }
 
+    if (this.disposed) {
+      return Ok(undefined);
+    }
+
     return this.streamWithHistory(options.model, options);
   }
 
@@ -550,6 +564,10 @@ export class AgentSession {
     modelString: string,
     options?: SendMessageOptions
   ): Promise<Result<void, SendMessageError>> {
+    if (this.disposed) {
+      return Ok(undefined);
+    }
+
     const commitResult = await this.partialService.commitToHistory(this.workspaceId);
     if (!commitResult.success) {
       return Err(createUnknownSendMessageError(commitResult.error));
@@ -716,7 +734,12 @@ export class AgentSession {
 
   // Public method to emit chat events (used by init hooks and other workspace events)
   emitChatEvent(message: WorkspaceChatMessage): void {
-    this.assertNotDisposed("emitChatEvent");
+    // NOTE: Workspace teardown does not await in-flight async work (sendMessage(), stopStream(), etc).
+    // Those code paths can still try to emit events after dispose; drop them rather than crashing.
+    if (this.disposed) {
+      return;
+    }
+
     this.emitter.emit("chat-event", {
       workspaceId: this.workspaceId,
       message,
@@ -775,6 +798,13 @@ export class AgentSession {
    * Called when tool execution completes, stream ends, or user clicks send immediately.
    */
   sendQueuedMessages(): void {
+    // sendQueuedMessages can race with teardown (e.g. workspace.remove) because we
+    // trigger it off stream/tool events and disposal does not await stopStream().
+    // If the session is already disposed, do nothing.
+    if (this.disposed) {
+      return;
+    }
+
     // Clear the queued message flag (even if queue is empty, to handle race conditions)
     this.backgroundProcessManager.setMessageQueued(this.workspaceId, false);
 
diff --git a/src/node/services/streamManager.ts b/src/node/services/streamManager.ts
@@ -622,17 +622,17 @@ export class StreamManager extends EventEmitter {
       abortSignal.addEventListener("abort", () => abortController.abort());
     }
 
-    // Determine toolChoice based on toolPolicy
+    // Determine toolChoice based on toolPolicy.
+    //
     // If a tool is required (tools object has exactly one tool after applyToolPolicy),
-    // force the model to use it with toolChoice: { type: "required", toolName: "..." }
-    let toolChoice: { type: "required"; toolName: string } | undefined;
+    // force the model to use it using the AI SDK tool choice shape.
+    let toolChoice: { type: "tool"; toolName: string } | "required" | undefined;
     if (tools && toolPolicy) {
-      // Check if any filter has "require" action
       const hasRequireAction = toolPolicy.some((filter) => filter.action === "require");
       if (hasRequireAction && Object.keys(tools).length === 1) {
         const requiredToolName = Object.keys(tools)[0];
-        toolChoice = { type: "required", toolName: requiredToolName };
-        log.debug("Setting toolChoice to required", { toolName: requiredToolName });
+        toolChoice = { type: "tool", toolName: requiredToolName };
+        log.debug("Setting toolChoice to tool", { toolName: requiredToolName });
       }
     }
 
diff --git a/tests/ipc/fixtures/mcp-screenshot-server.js b/tests/ipc/fixtures/mcp-screenshot-server.js
@@ -0,0 +1,142 @@
+// Minimal MCP server used by integration tests.
+//
+// Intentionally tiny + dependency-free: it speaks JSON-RPC over stdio
+// (newline-delimited JSON) and exposes a single screenshot tool.
+//
+// This lets us test the MCP → AI SDK image transformation without relying on
+// launching a real browser in CI.
+
+const readline = require("readline");
+
+/**
+ * Write a JSON-RPC message to stdout.
+ *
+ * NOTE: @ai-sdk/mcp stdio transport uses newline-delimited JSON.
+ */
+function send(message) {
+  process.stdout.write(`${JSON.stringify(message)}\n`);
+}
+
+const SERVER_INFO = { name: "mux-test-screenshot-mcp", version: "0.0.0" };
+
+const TOOLS = [
+  {
+    name: "take_screenshot",
+    description: "Return a deterministic screenshot image payload (base64) for tests.",
+    inputSchema: {
+      type: "object",
+      properties: {
+        format: {
+          type: "string",
+          enum: ["png", "jpeg"],
+          description: "Image format",
+        },
+      },
+      additionalProperties: true,
+    },
+  },
+];
+
+const rl = readline.createInterface({ input: process.stdin, crlfDelay: Infinity });
+
+rl.on("line", (line) => {
+  const trimmed = line.trim();
+  if (trimmed.length === 0) return;
+
+  let message;
+  try {
+    message = JSON.parse(trimmed);
+  } catch {
+    return;
+  }
+
+  if (message?.jsonrpc !== "2.0") return;
+
+  // Notifications have no id; ignore.
+  if (message.id === undefined) {
+    return;
+  }
+
+  const id = message.id;
+
+  try {
+    switch (message.method) {
+      case "initialize": {
+        const protocolVersion = message.params?.protocolVersion ?? "2024-11-05";
+        send({
+          jsonrpc: "2.0",
+          id,
+          result: {
+            protocolVersion,
+            capabilities: { tools: {} },
+            serverInfo: SERVER_INFO,
+          },
+        });
+        return;
+      }
+
+      case "tools/list": {
+        send({ jsonrpc: "2.0", id, result: { tools: TOOLS } });
+        return;
+      }
+
+      case "tools/call": {
+        const toolName = message.params?.name;
+        if (toolName !== "take_screenshot") {
+          send({
+            jsonrpc: "2.0",
+            id,
+            error: { code: -32601, message: `Unknown tool: ${toolName}` },
+          });
+          return;
+        }
+
+        const format = message.params?.arguments?.format;
+        const mimeType = format === "jpeg" ? "image/jpeg" : "image/png";
+
+        // Produce a deterministic payload large enough for tests (>1000 chars base64).
+        const fillByte = mimeType === "image/jpeg" ? 0x22 : 0x11;
+        const data = Buffer.alloc(2048, fillByte).toString("base64");
+
+        send({
+          jsonrpc: "2.0",
+          id,
+          result: {
+            content: [{ type: "image", data, mimeType }],
+          },
+        });
+        return;
+      }
+
+      default: {
+        send({
+          jsonrpc: "2.0",
+          id,
+          error: { code: -32601, message: `Method not found: ${message.method}` },
+        });
+        return;
+      }
+    }
+  } catch (error) {
+    send({
+      jsonrpc: "2.0",
+      id,
+      error: {
+        code: -32603,
+        message: error instanceof Error ? error.message : String(error),
+      },
+    });
+  }
+});
+
+rl.on("close", () => {
+  process.exit(0);
+});
+
+process.on("SIGTERM", () => {
+  rl.close();
+});
+
+process.on("SIGINT", () => {
+  rl.close();
+});
diff --git a/tests/ipc/mcpConfig.test.ts b/tests/ipc/mcpConfig.test.ts
diff --git a/tests/ipc/setup.ts b/tests/ipc/setup.ts