feat(option-c-offscreen): port health probe + request-id'd BUFFER + H1 try/catch

Implements the offscreen-side architectural refactor per
.planning/debug/empty-archive-port-race.md "Fix Strategy: Option C":

1. **Retired** the 290_000 ms pre-emptive reconnect setTimeout. Its race
   window between the synchronous .disconnect() and the onDisconnect
   handler running was the bisect-confirmed proximate cause of the H1
   "Attempting to use a disconnected port object" Uncaught Errors.

2. **Added** PONG-based health probe: each ping increments missedPongs;
   if MAX_MISSED_PONGS (3) consecutive PINGs go without echo, reconnect
   via the same clean teardown path the onDisconnect handler uses.
   PONG receipt resets the counter. Liveness-based replacement for the
   time-based pre-emptive rotation.

3. **H1 fix** — wrap PING postMessage in try/catch. The port object can
   transition to disconnected synchronously (SW eviction, port glitch)
   between the interval-callback being queued and it running. The catch
   absorbs the throw and routes through reconnectPort() — no more
   uncaught throws bubble out to the offscreen console.

4. **Request-id'd protocol** — REQUEST_BUFFER carries the SW-generated
   requestId; BUFFER response echoes it. The offscreen now posts on the
   CURRENT keepalivePort (no more portAtRequest stale-port refuse-to-
   post). The SW matches BUFFER → request by id, so port replacement
   mid-encode no longer drops the response — the SW retries on the new
   port and the matching BUFFER routes correctly.

5. **reconnectPort(reason)** — new helper consolidating the
   teardown+disconnect+reconnect dance used by both the missed-PONG
   path and the synchronous-throw path. Idempotent w.r.t. the chained
   onDisconnect callback.

Test updates:
  - H2 now sends REQUEST_BUFFER with a requestId (Option C contract).
  - H1.b refactored to test the externally-disconnected path (since the
    pre-emptive timeout path is gone): port._disconnected=true, fire
    ping, assert no throw + a fresh port appears.
  - Top-level snapshots of timer globals + afterEach restoration so a
    failing test doesn't leak overridden globals into the next test.

Status: 48 GREEN, 4 RED (the remaining RED is all SW-side — addressed
in next commit). All H1 + H1.b + H2 contracts now GREEN. Pinning
contracts (D-12 port-serialization, D-13 segment-rotation, A3 webm-
playback) untouched. tsc --noEmit exit 0; type-safety grep clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 14:37:22 +02:00
parent c6e8101860
commit 1fb3e978cb
2 changed files with 181 additions and 106 deletions

View File

@@ -25,7 +25,7 @@
// tests/offscreen/port.test.ts + handshake.test.ts). No source-code
// re-implementation — the production module is what we're testing.
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
interface PortStub {
name: string;
@@ -122,21 +122,27 @@ function buildChromeStub(ports: PortStub[]): ChromeStub {
};
}
describe('port reconnect race (RED — confirms empty-archive-port-race H1+H2)', () => {
let originalSetInterval: typeof globalThis.setInterval;
let originalClearInterval: typeof globalThis.clearInterval;
let originalSetTimeout: typeof globalThis.setTimeout;
let originalClearTimeout: typeof globalThis.clearTimeout;
// Top-level snapshots of the real timer APIs — captured at module load,
// BEFORE any test runs, so they survive a test that fails before its
// inline restore code. The afterEach below restores unconditionally.
const ORIGINAL_SET_INTERVAL = globalThis.setInterval;
const ORIGINAL_CLEAR_INTERVAL = globalThis.clearInterval;
const ORIGINAL_SET_TIMEOUT = globalThis.setTimeout;
const ORIGINAL_CLEAR_TIMEOUT = globalThis.clearTimeout;
describe('port reconnect race (RED — confirms empty-archive-port-race H1+H2)', () => {
beforeEach(() => {
vi.resetModules();
(globalThis as unknown as GlobalWithChrome).MediaRecorder = {
isTypeSupported: vi.fn().mockReturnValue(true),
};
originalSetInterval = globalThis.setInterval;
originalClearInterval = globalThis.clearInterval;
originalSetTimeout = globalThis.setTimeout;
originalClearTimeout = globalThis.clearTimeout;
});
afterEach(() => {
globalThis.setInterval = ORIGINAL_SET_INTERVAL;
globalThis.clearInterval = ORIGINAL_CLEAR_INTERVAL;
globalThis.setTimeout = ORIGINAL_SET_TIMEOUT;
globalThis.clearTimeout = ORIGINAL_CLEAR_TIMEOUT;
});
// ──────────────────────────────────────────────────────────────────────
@@ -199,9 +205,6 @@ describe('port reconnect race (RED — confirms empty-archive-port-race H1+H2)',
// operator observed in UAT Test 3.
expect(() => pingCb()).not.toThrow();
// Restore globals.
globalThis.setInterval = originalSetInterval;
globalThis.clearInterval = originalClearInterval;
});
// ──────────────────────────────────────────────────────────────────────
@@ -272,7 +275,16 @@ describe('port reconnect race (RED — confirms empty-archive-port-race H1+H2)',
// 2) SW issues REQUEST_BUFFER — synthetically dispatch to the
// captured onPortMessage listener. The production code kicks off
// encodeAndSendBuffer which awaits 3 blobToBase64 calls.
oldPort.onMessage._listeners[0]({ type: 'REQUEST_BUFFER' });
//
// Under the Option C protocol the request carries a requestId the
// SW generates per call; the offscreen echoes it on the BUFFER
// response. This test uses a literal id since it's not driving
// the SW — only the recorder's onPortMessage path.
const requestId = 'h2-test-id';
oldPort.onMessage._listeners[0]({
type: 'REQUEST_BUFFER',
requestId,
});
// 3) Mid-encode, the pre-emptive reconnect fires. Simulate by
// firing onDisconnect on the old port — production code's
@@ -346,82 +358,49 @@ describe('port reconnect race (RED — confirms empty-archive-port-race H1+H2)',
});
// ──────────────────────────────────────────────────────────────────────
// H1.b — pre-emptive reconnect path specifically. Walk the same
// sequence the operator hit at 290 s: the production code calls
// `keepalivePort?.disconnect()` and the test then invokes the ping
// callback that was installed BEFORE the reconnect. This proves the
// race window in the production timer-clear ordering.
// H1.b — externally-disconnected port path. The original 290 s
// pre-emptive setTimeout is gone (Option C retired it — see the
// "Bisect Results" section of empty-archive-port-race.md). What
// remains is the same H1 race shape, just triggered by a different
// path: the SW side (or Chrome's idle eviction, or a network glitch)
// can mark the remote end of the port as disconnected; the local
// postMessage then throws synchronously with the same Chrome error
// string. The fix's try/catch + reconnect path must absorb that.
// ──────────────────────────────────────────────────────────────────────
it('H1.b: pre-emptive reconnect path — ping does not throw against just-disconnected port', async () => {
it('H1.b: externally-disconnected port — ping path absorbs the throw and reconnects cleanly', async () => {
const ports: PortStub[] = [];
const stub = buildChromeStub(ports);
(globalThis as unknown as GlobalWithChrome).chrome = stub;
// Capture both interval callbacks AND timeout callbacks for direct
// invocation. We need to fire the pre-emptive setTimeout
// synthetically (it would otherwise wait 290 s of wall-clock time).
const intervalCallbacks: Array<() => void> = [];
const timeoutCallbacks: Array<() => void> = [];
globalThis.setInterval = ((cb: () => void, _ms: number) => {
intervalCallbacks.push(cb);
return 100 as unknown as ReturnType<typeof setInterval>;
}) as typeof globalThis.setInterval;
globalThis.setTimeout = ((cb: () => void, _ms: number) => {
timeoutCallbacks.push(cb);
return 200 as unknown as ReturnType<typeof setTimeout>;
}) as typeof globalThis.setTimeout;
// No-op the clear* so the captured callbacks remain callable —
// surfaces the race-window assumption in the production code.
globalThis.clearInterval = (() => {}) as typeof globalThis.clearInterval;
globalThis.clearTimeout = (() => {}) as typeof globalThis.clearTimeout;
await import('../../src/offscreen/recorder');
expect(ports.length).toBe(1);
const port = ports[0];
// intervalCallbacks[0] is the ping callback (first setInterval call).
// timeoutCallbacks contains at least the pre-emptive reconnect timer
// (recorder.ts:626). Other setTimeouts may exist (e.g. queueMicrotask
// shims, scheduleRotation — but the latter is gated on mediaStream
// which is null in this test).
expect(intervalCallbacks.length).toBeGreaterThanOrEqual(1);
expect(timeoutCallbacks.length).toBeGreaterThanOrEqual(1);
const oldPort = ports[0];
const pingCb = intervalCallbacks[0];
// The pre-emptive reconnect timeout is the only setTimeout the
// bootstrap installs (rotation is gated on mediaStream). Pick the
// first to keep the test resilient.
const preemptiveReconnect = timeoutCallbacks[0];
// 1) Fire pre-emptive reconnect the production code calls
// keepalivePort?.disconnect(). Our stub flips _disconnected=true
// synchronously (matching Chrome's local disconnect semantics).
preemptiveReconnect();
expect(port._disconnected).toBe(true);
expect(port.disconnect).toHaveBeenCalled();
// Simulate an EXTERNAL disconnect: the remote end of the port has
// been torn down (SW eviction, port glitch, network blip), but the
// local onDisconnect handler hasn't fired yet — this is exactly the
// half-open state that the original H1 race weaponised.
oldPort._disconnected = true;
// 2) Now the race: an already-queued ping callback (e.g. scheduled
// before the reconnect) fires against the just-disconnected port.
// Production code: keepalivePort?.postMessage({type:'PING'}) —
// keepalivePort is still the old non-null reference at this point
// (the onDisconnect handler may not have run yet in the same
// macrotask), so postMessage runs against the disconnected port
// and our stub throws — exactly matching the UAT error message.
//
// The fix must make this safe. Acceptable strategies:
// - Clear the interval as the FIRST step of the pre-emptive
// reconnect, before .disconnect().
// - Wrap postMessage in try/catch inside the ping callback.
// - Switch to a per-port closure (capturedPort) so a stale ping
// can detect it's pinging a swapped port.
// Fire the ping. Under the legacy code this threw with "Attempting
// to use a disconnected port object"; under the Option C fix the
// ping callback wraps postMessage in try/catch and initiates a
// clean reconnect via reconnectPort(). The test asserts BOTH that
// no Uncaught Error escapes AND that a fresh port has been
// requested (the visible side-effect of the reconnect path).
expect(() => pingCb()).not.toThrow();
expect(ports.length).toBeGreaterThanOrEqual(2);
// Restore globals.
globalThis.setInterval = originalSetInterval;
globalThis.clearInterval = originalClearInterval;
globalThis.setTimeout = originalSetTimeout;
globalThis.clearTimeout = originalClearTimeout;
});
});