Skip to content

Commit 9aee49b

Browse files
authored
🤖 ci: deflake MCP screenshot integration test (#1173)
Deflake MCP Chrome screenshot integration test. Changes: - Stop leaking MCP/Chrome processes by ensuring `setupWorkspace()` tears down the workspace via `workspace.remove(...)` and disposes services during test cleanup. - Make the screenshot assertions deterministic by forcing a `chrome_take_screenshot` tool call via `toolPolicy: require` (no longer depends on the model deciding to use tools). - Reduce CI variance by pinning `chrome-devtools-mcp` and using a fixed viewport; run PNG/JPEG cases sequentially; increase the post-`stream-end` tool-call wait. Validation: - `make static-check` --- <details> <summary>📋 Implementation Plan</summary> # Deflake: `tests/ipc/mcpConfig.test.ts` MCP screenshot test ## What’s failing - **Test:** `MCP server integration with model › MCP PNG image content is correctly transformed to AI SDK format` - **Failure:** `waitForToolCallEnd(…, "chrome_take_screenshot", …)` returns `undefined` → no matching `tool-call-end` event. ## Likely root causes (ranked) 1. **Model nondeterminism:** the prompt uses a well-known page (`example.com`), so the model can sometimes answer from prior knowledge and skip `chrome_take_screenshot` entirely. 2. **Leaked MCP server processes between tests:** `setupWorkspace()` (in `tests/ipc/setup.ts`) never calls `workspace.remove`, so MCP servers started during a test can keep running. That matches the suite’s “Force exiting Jest…” warning and can cause resource contention / sporadic MCP startup failures. 3. **Timing/resource contention:** the test runs PNG+JPEG cases concurrently and starts headless Chrome via `npx`; on slower CI hosts, tool execution and event delivery may exceed the current 20s polling window. <details> <summary>🔎 Evidence in repo</summary> - `tests/ipc/setup.ts::setupWorkspace()` cleanup only deletes temp dirs; it does **not** call `env.orpc.workspace.remove({ workspaceId })`. - `WorkspaceService.remove()` explicitly stops MCP servers via `mcpServerManager.stopServers(workspaceId)`. - `mcpConfig.test.ts` depends on the model choosing to call `chrome_take_screenshot` (not enforced). </details> --- ## Recommended approach (A): Keep the integration test, but make it deterministic **Net new product LoC:** ~0 (test/harness only) ### A1) Fix cleanup so MCP servers don’t leak across tests 1. Update `tests/ipc/setup.ts::setupWorkspace()`’s `cleanup()` to: - `await env.orpc.workspace.remove({ workspaceId }).catch(() => {})` (must run **before** deleting `env.tempDir`) - `await env.services.dispose()` (clears MCP idle interval + terminates background procs) - then run existing `cleanupTestEnvironment(env)` + `cleanupTempGitRepo(tempGitRepo)` This should eliminate orphaned Chrome/MCP processes and reduce CI flake across the whole integration suite. ### A2) Stop relying on the model “choosing” to call screenshot tools Modify `mcpConfig.test.ts` so the test asserts the transformation path without depending on free-form model behavior. Concrete options (pick one): **Option 1 (preferred): force the tool call using `toolPolicy: require` and don’t assert the description** - Send a minimal prompt like: - PNG: “Call `chrome_take_screenshot` now.” - JPEG: “Call `chrome_take_screenshot` with format \"jpeg\".” - Pass `options.toolPolicy = [{ regex_match: "chrome_take_screenshot", action: "require" }]`. - Only assert: - a `tool-call-end` event exists for `chrome_take_screenshot` - `assertValidScreenshotResult(…, mediaTypePattern)` passes - Drop (or relax) `assertModelDescribesScreenshot()`; it adds LLM-output flake and isn’t needed to validate the MCP→AI-SDK media transformation. **Option 2: split into two required calls (navigate then screenshot)** - Message 1: require `chrome_navigate_page` and instruct URL. - Message 2: require `chrome_take_screenshot`. - This is useful only if we still want to validate “example.com” specifically; otherwise it’s extra moving parts. ### A3) Reduce environment-driven variance - Pin the MCP server version: replace `chrome-devtools-mcp@latest` with the currently observed version (`chrome-devtools-mcp@0.12.1`). - Add a deterministic viewport (smaller = faster + avoids huge PNGs): `--viewport 1280x720`. - If CI still flakes, run PNG/JPEG sequentially (remove `test.concurrent.each`). - Increase `waitForToolCallEnd` timeout from 20s → 60s (CI headless Chrome can be slow). --- ## Alternative approach (B): Move correctness to unit tests; keep only a small integration smoke test **Net new product LoC:** ~0 1. Add a unit test suite for `src/node/services/mcpResultTransform.ts`: - converts MCP `{ content: [{type:"image", data, mimeType}] }` → `{ type:"content", value:[{type:"media", …}] }` - preserves `mimeType` → `mediaType` - validates the size guard behavior (`MAX_IMAGE_DATA_BYTES`) deterministically 2. Replace the flaky Chrome+model integration assertion with: - existing `memory_create_entities` MCP integration test (already present) - optional: a chrome MCP “tools available” test (no screenshot, no model) Use this if we decide that a full Chrome+LLM flow is too expensive/flaky for CI. --- ## Optional product hardening (nice-to-have) **Net new product LoC:** ~20–60 - Consider making `MCPServerManager.dispose()` stop all running workspace servers (not just clear the idle interval). This would harden app shutdown behavior and prevent long-lived processes in any non-test embedding. --- ## Validation - Run the failing test in a loop (CI-like): - `TEST_INTEGRATION=1 bun x jest tests/ipc/mcpConfig.test.ts -t "image content" --runInBand` - Confirm: - `tool-call-end` for `chrome_take_screenshot` is always present - no lingering Node handles (the “Force exiting Jest…” warning disappears or is reduced) </details> --- _Generated with `mux` • Model: `openai:gpt-5.2` • Thinking: `xhigh`_ --------- Signed-off-by: Thomas Kosiewski <tk@coder.com>
1 parent 42385f0 commit 9aee49b

File tree

6 files changed

+374
-42
lines changed

6 files changed

+374
-42
lines changed
Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
import { describe, expect, test, mock } from "bun:test";
2+
import { AgentSession } from "./agentSession";
3+
import type { Config } from "@/node/config";
4+
import type { HistoryService } from "./historyService";
5+
import type { PartialService } from "./partialService";
6+
import type { AIService } from "./aiService";
7+
import type { InitStateManager } from "./initStateManager";
8+
import type { BackgroundProcessManager } from "./backgroundProcessManager";
9+
import type { Result } from "@/common/types/result";
10+
import { Ok } from "@/common/types/result";
11+
12+
function createDeferred<T>(): {
13+
promise: Promise<T>;
14+
resolve: (value: T) => void;
15+
} {
16+
let resolve!: (value: T) => void;
17+
const promise = new Promise<T>((res) => {
18+
resolve = res;
19+
});
20+
return { promise, resolve };
21+
}
22+
23+
describe("AgentSession disposal race conditions", () => {
24+
test("does not crash if disposed while auto-sending a queued message", async () => {
25+
const aiHandlers = new Map<string, (...args: unknown[]) => void>();
26+
27+
const streamMessage = mock(() => Promise.resolve(Ok(undefined)));
28+
29+
const aiService: AIService = {
30+
on(eventName: string | symbol, listener: (...args: unknown[]) => void) {
31+
aiHandlers.set(String(eventName), listener);
32+
return this;
33+
},
34+
off(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
35+
return this;
36+
},
37+
stopStream: mock(() => Promise.resolve(Ok(undefined))),
38+
isStreaming: mock(() => false),
39+
streamMessage,
40+
} as unknown as AIService;
41+
42+
const appendDeferred = createDeferred<Result<void>>();
43+
const historyService: HistoryService = {
44+
appendToHistory: mock(() => appendDeferred.promise),
45+
} as unknown as HistoryService;
46+
47+
const initStateManager: InitStateManager = {
48+
on(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
49+
return this;
50+
},
51+
off(_eventName: string | symbol, _listener: (...args: unknown[]) => void) {
52+
return this;
53+
},
54+
} as unknown as InitStateManager;
55+
56+
const backgroundProcessManager: BackgroundProcessManager = {
57+
cleanup: mock(() => Promise.resolve()),
58+
setMessageQueued: mock(() => undefined),
59+
} as unknown as BackgroundProcessManager;
60+
61+
const config: Config = { srcDir: "/tmp" } as unknown as Config;
62+
const partialService: PartialService = {} as unknown as PartialService;
63+
64+
const session = new AgentSession({
65+
workspaceId: "ws",
66+
config,
67+
historyService,
68+
partialService,
69+
aiService,
70+
initStateManager,
71+
backgroundProcessManager,
72+
});
73+
74+
// Capture the fire-and-forget sendMessage() promise that sendQueuedMessages() creates.
75+
const originalSendMessage = session.sendMessage.bind(session);
76+
let inFlight: Promise<unknown> | undefined;
77+
(session as unknown as { sendMessage: typeof originalSendMessage }).sendMessage = (
78+
...args: Parameters<typeof originalSendMessage>
79+
) => {
80+
const promise = originalSendMessage(...args);
81+
inFlight = promise;
82+
return promise;
83+
};
84+
85+
session.queueMessage("Queued message", { model: "anthropic:claude-sonnet-4-5" });
86+
session.sendQueuedMessages();
87+
88+
expect(inFlight).toBeDefined();
89+
90+
// Dispose while sendMessage() is awaiting appendToHistory.
91+
session.dispose();
92+
appendDeferred.resolve(Ok(undefined));
93+
94+
const result = await (inFlight as Promise<Result<void>>);
95+
expect(result.success).toBe(true);
96+
97+
// We should not attempt to stream once disposal has begun.
98+
expect(streamMessage).toHaveBeenCalledTimes(0);
99+
100+
// Sanity: invoking a forwarded handler after dispose should be a no-op.
101+
const streamStart = aiHandlers.get("stream-start");
102+
expect(() =>
103+
streamStart?.({
104+
type: "stream-start",
105+
workspaceId: "ws",
106+
messageId: "m1",
107+
model: "anthropic:claude-sonnet-4-5",
108+
timestamp: Date.now(),
109+
})
110+
).not.toThrow();
111+
});
112+
});

src/node/services/agentSession.ts

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -453,13 +453,23 @@ export class AgentSession {
453453
return Err(createUnknownSendMessageError(appendResult.error));
454454
}
455455

456+
// Workspace may be tearing down while we await filesystem IO.
457+
// If so, skip event emission + streaming to avoid races with dispose().
458+
if (this.disposed) {
459+
return Ok(undefined);
460+
}
461+
456462
// Add type: "message" for discriminated union (createMuxMessage doesn't add it)
457463
this.emitChatEvent({ ...userMessage, type: "message" });
458464

459465
// If this is a compaction request, terminate background processes first
460466
// They won't be included in the summary, so continuing with orphaned processes would be confusing
461467
if (isCompactionRequestMetadata(typedMuxMetadata)) {
462468
await this.backgroundProcessManager.cleanup(this.workspaceId);
469+
470+
if (this.disposed) {
471+
return Ok(undefined);
472+
}
463473
}
464474

465475
// If this is a compaction request with a continue message, queue it for auto-send after compaction
@@ -501,6 +511,10 @@ export class AgentSession {
501511
this.emitQueuedMessageChanged();
502512
}
503513

514+
if (this.disposed) {
515+
return Ok(undefined);
516+
}
517+
504518
return this.streamWithHistory(options.model, options);
505519
}
506520

@@ -550,6 +564,10 @@ export class AgentSession {
550564
modelString: string,
551565
options?: SendMessageOptions
552566
): Promise<Result<void, SendMessageError>> {
567+
if (this.disposed) {
568+
return Ok(undefined);
569+
}
570+
553571
const commitResult = await this.partialService.commitToHistory(this.workspaceId);
554572
if (!commitResult.success) {
555573
return Err(createUnknownSendMessageError(commitResult.error));
@@ -716,7 +734,12 @@ export class AgentSession {
716734

717735
// Public method to emit chat events (used by init hooks and other workspace events)
718736
emitChatEvent(message: WorkspaceChatMessage): void {
719-
this.assertNotDisposed("emitChatEvent");
737+
// NOTE: Workspace teardown does not await in-flight async work (sendMessage(), stopStream(), etc).
738+
// Those code paths can still try to emit events after dispose; drop them rather than crashing.
739+
if (this.disposed) {
740+
return;
741+
}
742+
720743
this.emitter.emit("chat-event", {
721744
workspaceId: this.workspaceId,
722745
message,
@@ -775,6 +798,13 @@ export class AgentSession {
775798
* Called when tool execution completes, stream ends, or user clicks send immediately.
776799
*/
777800
sendQueuedMessages(): void {
801+
// sendQueuedMessages can race with teardown (e.g. workspace.remove) because we
802+
// trigger it off stream/tool events and disposal does not await stopStream().
803+
// If the session is already disposed, do nothing.
804+
if (this.disposed) {
805+
return;
806+
}
807+
778808
// Clear the queued message flag (even if queue is empty, to handle race conditions)
779809
this.backgroundProcessManager.setMessageQueued(this.workspaceId, false);
780810

src/node/services/streamManager.ts

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -622,17 +622,17 @@ export class StreamManager extends EventEmitter {
622622
abortSignal.addEventListener("abort", () => abortController.abort());
623623
}
624624

625-
// Determine toolChoice based on toolPolicy
625+
// Determine toolChoice based on toolPolicy.
626+
//
626627
// If a tool is required (tools object has exactly one tool after applyToolPolicy),
627-
// force the model to use it with toolChoice: { type: "required", toolName: "..." }
628-
let toolChoice: { type: "required"; toolName: string } | undefined;
628+
// force the model to use it using the AI SDK tool choice shape.
629+
let toolChoice: { type: "tool"; toolName: string } | "required" | undefined;
629630
if (tools && toolPolicy) {
630-
// Check if any filter has "require" action
631631
const hasRequireAction = toolPolicy.some((filter) => filter.action === "require");
632632
if (hasRequireAction && Object.keys(tools).length === 1) {
633633
const requiredToolName = Object.keys(tools)[0];
634-
toolChoice = { type: "required", toolName: requiredToolName };
635-
log.debug("Setting toolChoice to required", { toolName: requiredToolName });
634+
toolChoice = { type: "tool", toolName: requiredToolName };
635+
log.debug("Setting toolChoice to tool", { toolName: requiredToolName });
636636
}
637637
}
638638

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
// Minimal MCP server used by integration tests.
2+
//
3+
// Intentionally tiny + dependency-free: it speaks JSON-RPC over stdio
4+
// (newline-delimited JSON) and exposes a single screenshot tool.
5+
//
6+
// This lets us test the MCP → AI SDK image transformation without relying on
7+
// launching a real browser in CI.
8+
9+
const readline = require("readline");
10+
11+
/**
12+
* Write a JSON-RPC message to stdout.
13+
*
14+
* NOTE: @ai-sdk/mcp stdio transport uses newline-delimited JSON.
15+
*/
16+
function send(message) {
17+
process.stdout.write(`${JSON.stringify(message)}\n`);
18+
}
19+
20+
const SERVER_INFO = { name: "mux-test-screenshot-mcp", version: "0.0.0" };
21+
22+
const TOOLS = [
23+
{
24+
name: "take_screenshot",
25+
description: "Return a deterministic screenshot image payload (base64) for tests.",
26+
inputSchema: {
27+
type: "object",
28+
properties: {
29+
format: {
30+
type: "string",
31+
enum: ["png", "jpeg"],
32+
description: "Image format",
33+
},
34+
},
35+
additionalProperties: true,
36+
},
37+
},
38+
];
39+
40+
const rl = readline.createInterface({ input: process.stdin, crlfDelay: Infinity });
41+
42+
rl.on("line", (line) => {
43+
const trimmed = line.trim();
44+
if (trimmed.length === 0) return;
45+
46+
let message;
47+
try {
48+
message = JSON.parse(trimmed);
49+
} catch {
50+
return;
51+
}
52+
53+
if (message?.jsonrpc !== "2.0") return;
54+
55+
// Notifications have no id; ignore.
56+
if (message.id === undefined) {
57+
return;
58+
}
59+
60+
const id = message.id;
61+
62+
try {
63+
switch (message.method) {
64+
case "initialize": {
65+
const protocolVersion = message.params?.protocolVersion ?? "2024-11-05";
66+
send({
67+
jsonrpc: "2.0",
68+
id,
69+
result: {
70+
protocolVersion,
71+
capabilities: { tools: {} },
72+
serverInfo: SERVER_INFO,
73+
},
74+
});
75+
return;
76+
}
77+
78+
case "tools/list": {
79+
send({ jsonrpc: "2.0", id, result: { tools: TOOLS } });
80+
return;
81+
}
82+
83+
case "tools/call": {
84+
const toolName = message.params?.name;
85+
if (toolName !== "take_screenshot") {
86+
send({
87+
jsonrpc: "2.0",
88+
id,
89+
error: { code: -32601, message: `Unknown tool: ${toolName}` },
90+
});
91+
return;
92+
}
93+
94+
const format = message.params?.arguments?.format;
95+
const mimeType = format === "jpeg" ? "image/jpeg" : "image/png";
96+
97+
// Produce a deterministic payload large enough for tests (>1000 chars base64).
98+
const fillByte = mimeType === "image/jpeg" ? 0x22 : 0x11;
99+
const data = Buffer.alloc(2048, fillByte).toString("base64");
100+
101+
send({
102+
jsonrpc: "2.0",
103+
id,
104+
result: {
105+
content: [{ type: "image", data, mimeType }],
106+
},
107+
});
108+
return;
109+
}
110+
111+
default: {
112+
send({
113+
jsonrpc: "2.0",
114+
id,
115+
error: { code: -32601, message: `Method not found: ${message.method}` },
116+
});
117+
return;
118+
}
119+
}
120+
} catch (error) {
121+
send({
122+
jsonrpc: "2.0",
123+
id,
124+
error: {
125+
code: -32603,
126+
message: error instanceof Error ? error.message : String(error),
127+
},
128+
});
129+
}
130+
});
131+
132+
rl.on("close", () => {
133+
process.exit(0);
134+
});
135+
136+
process.on("SIGTERM", () => {
137+
rl.close();
138+
});
139+
140+
process.on("SIGINT", () => {
141+
rl.close();
142+
});

0 commit comments

Comments
 (0)