feat(option-c-offscreen): port health probe + request-id'd BUFFER + H1 try/catch

Implements the offscreen-side architectural refactor per
.planning/debug/empty-archive-port-race.md "Fix Strategy: Option C":

1. **Retired** the 290_000 ms pre-emptive reconnect setTimeout. Its race
   window between the synchronous .disconnect() and the onDisconnect
   handler running was the bisect-confirmed proximate cause of the H1
   "Attempting to use a disconnected port object" Uncaught Errors.

2. **Added** PONG-based health probe: each ping increments missedPongs;
   if MAX_MISSED_PONGS (3) consecutive PINGs go without echo, reconnect
   via the same clean teardown path the onDisconnect handler uses.
   PONG receipt resets the counter. Liveness-based replacement for the
   time-based pre-emptive rotation.

3. **H1 fix** — wrap PING postMessage in try/catch. The port object can
   transition to disconnected synchronously (SW eviction, port glitch)
   between the interval-callback being queued and it running. The catch
   absorbs the throw and routes through reconnectPort() — no more
   uncaught throws bubble out to the offscreen console.

4. **Request-id'd protocol** — REQUEST_BUFFER carries the SW-generated
   requestId; BUFFER response echoes it. The offscreen now posts on the
   CURRENT keepalivePort (no more portAtRequest stale-port refuse-to-
   post). The SW matches BUFFER → request by id, so port replacement
   mid-encode no longer drops the response — the SW retries on the new
   port and the matching BUFFER routes correctly.

5. **reconnectPort(reason)** — new helper consolidating the
   teardown+disconnect+reconnect dance used by both the missed-PONG
   path and the synchronous-throw path. Idempotent w.r.t. the chained
   onDisconnect callback.

Test updates:
  - H2 now sends REQUEST_BUFFER with a requestId (Option C contract).
  - H1.b refactored to test the externally-disconnected path (since the
    pre-emptive timeout path is gone): port._disconnected=true, fire
    ping, assert no throw + a fresh port appears.
  - Top-level snapshots of timer globals + afterEach restoration so a
    failing test doesn't leak overridden globals into the next test.

Status: 48 GREEN, 4 RED (the remaining RED is all SW-side — addressed
in next commit). All H1 + H1.b + H2 contracts now GREEN. Pinning
contracts (D-12 port-serialization, D-13 segment-rotation, A3 webm-
playback) untouched. tsc --noEmit exit 0; type-safety grep clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 14:37:22 +02:00
parent c6e8101860
commit 1fb3e978cb
2 changed files with 181 additions and 106 deletions

View File

@@ -36,7 +36,15 @@ const VIDEO_BITRATE = 400_000; // CON-video-codec
// контракт ротации.
const PORT_NAME = 'video-keepalive'; // long-lived port name (D-17, Pattern 5)
const PORT_PING_MS = 25_000; // < 30 s SW idle threshold
const PORT_RECONNECT_MS = 290_000; // pre-empt the ~5 min port cap (Pitfall 4)
// Option C health-probe: how many consecutive PINGs may go unanswered
// (no PONG echo) before we declare the port dead and reconnect. With
// PORT_PING_MS = 25_000 ms, 3 missed PONGs = 75 s of unresponsive SW —
// well past the ~30 s Chrome SW idle threshold, so a real disconnect
// already would have fired its onDisconnect handler by then. This
// threshold catches the failure mode the old 290 s pre-emptive setTimeout
// pretended to: the SW is up but the port has gone half-open.
// See .planning/debug/empty-archive-port-race.md (Fix Strategy: Option C).
const MAX_MISSED_PONGS = 3;
const logger = new OffscreenLogger('Recorder');
@@ -466,29 +474,56 @@ function isFromOwnExtension(sender: chrome.runtime.MessageSender | undefined): b
return sender?.id === chrome.runtime.id;
}
// Stable handles for the ping interval and the pre-emptive reconnect timer,
// so we can clear them on disconnect / re-init.
// Stable handle for the ping interval. The legacy 290 s pre-emptive
// reconnect setTimeout was retired in the Option C refactor — its race
// window between the synchronous .disconnect() and the onDisconnect
// handler running was the proximate cause of the H1 "Attempting to use a
// disconnected port object" Uncaught Errors per the bisect notes in
// .planning/debug/empty-archive-port-race.md.
let pingIntervalId: ReturnType<typeof setInterval> | null = null;
let preemptiveReconnectId: ReturnType<typeof setTimeout> | null = null;
// Option C health-probe state. The offscreen pings every PORT_PING_MS and
// expects a PONG echo from the SW. If MAX_MISSED_PONGS consecutive PINGs
// go unanswered, the port is considered dead and a clean reconnect is
// initiated via the same teardown path the onDisconnect handler uses.
// This replaces the time-based pre-emptive reconnect with a liveness-
// based one: the port lives until it stops responding, not until a wall
// clock says it should rotate.
let missedPongs = 0;
function teardownPortTimers(): void {
if (pingIntervalId !== null) {
clearInterval(pingIntervalId);
pingIntervalId = null;
}
if (preemptiveReconnectId !== null) {
clearTimeout(preemptiveReconnectId);
preemptiveReconnectId = null;
}
}
function onPortMessage(message: unknown): void {
// Defense-in-depth: explicit shape check before destructuring (T-1-04)
// Defense-in-depth: explicit shape check before destructuring (T-1-04).
if (typeof message !== 'object' || message === null) {
return;
}
const type = (message as { type?: unknown }).type;
if (type === 'PONG') {
// Health-probe echo. Resetting missedPongs to 0 closes the probe loop
// — the port is demonstrably alive.
missedPongs = 0;
return;
}
if (type === 'REQUEST_BUFFER') {
// Option C: REQUEST_BUFFER carries a requestId the SW generated per
// call. We echo it on the BUFFER response so the SW's per-request
// listener can match by id (preventing cross-talk between concurrent
// requests and between retries on port replacement).
const requestId = (message as { requestId?: unknown }).requestId;
if (typeof requestId !== 'string' || requestId.length === 0) {
// Defense-in-depth: a REQUEST_BUFFER without a valid id is invalid
// under the Option C protocol — drop with a warn. Legacy callers
// are absent from the codebase; this is purely a guard against
// future regressions or hostile peers.
logger.warn('REQUEST_BUFFER without a valid requestId — dropping');
return;
}
// Fire-and-forget: the chrome.runtime.Port.onMessage listener API
// ignores the return value. We do the async base64 encoding inside
// an IIFE so the listener stays synchronous-typed (void return).
@@ -497,7 +532,7 @@ function onPortMessage(message: unknown): void {
// because chrome.runtime.Port JSON-serializes across extension contexts,
// and JSON.stringify(blob) === "{}". See src/shared/binary.ts and
// tests/offscreen/port-serialization.test.ts for the contract.
void encodeAndSendBuffer();
void encodeAndSendBuffer(requestId);
}
// Any unknown port message type is silently dropped (T-1-04 defense-in-depth).
}
@@ -519,32 +554,24 @@ function onPortMessage(message: unknown): void {
// post-completion state.
let encodeInFlight = false;
async function encodeAndSendBuffer(): Promise<void> {
async function encodeAndSendBuffer(requestId: string): Promise<void> {
if (encodeInFlight) {
logger.warn('encodeAndSendBuffer already running — dropping concurrent call');
return;
}
// CR-01 fix: capture the port identity BEFORE the await. If `keepalivePort`
// is replaced by a fresh reconnect during base64 encoding, posting on the
// new port would silently leak the BUFFER to a stranger — the SW's
// per-request listener is still bound to the OLD port. The SW already
// times out cleanly (BUFFER_FETCH_TIMEOUT_MS = 2 s), so dropping a stale
// response on the floor is the correct behaviour: the next SAVE_ARCHIVE
// round-trip will REQUEST_BUFFER on the fresh port.
const portAtRequest = keepalivePort;
if (portAtRequest === null) {
if (keepalivePort === null) {
logger.warn('encodeAndSendBuffer called without an active port — drop');
return;
}
encodeInFlight = true;
try {
await doEncodeAndSendBuffer(portAtRequest);
await doEncodeAndSendBuffer(requestId);
} finally {
encodeInFlight = false;
}
}
async function doEncodeAndSendBuffer(portAtRequest: chrome.runtime.Port): Promise<void> {
async function doEncodeAndSendBuffer(requestId: string): Promise<void> {
// WR-09 fix: an in-flight segment lacks the Matroska finalization that
// MediaRecorder.stop() performs (SegmentSize, Cues) — splicing it onto
// a finalized tail re-introduces the "File ended prematurely" symptom
@@ -589,18 +616,62 @@ async function doEncodeAndSendBuffer(portAtRequest: chrome.runtime.Port): Promis
);
}
}
// After the await: refuse to post on a port that has been REPLACED
// by reconnect. The SW listens on the OLD port; posting on the NEW
// port would silently lose the data. Letting the SW time out is
// correct — the next SAVE_ARCHIVE will re-issue REQUEST_BUFFER on
// the fresh port.
if (keepalivePort !== portAtRequest) {
// After the await: post the BUFFER on the CURRENT keepalivePort. Under
// the Option C protocol the BUFFER carries the requestId the SW issued,
// so even if the port was replaced mid-encode (and the SW's matching
// listener migrated to the new port via the SW-side retry — see
// src/background/index.ts:getVideoBufferFromOffscreen), the BUFFER
// routes correctly by id. This retires the H2 silent-drop class that
// the bisect (.planning/debug/empty-archive-port-race.md) traced to
// the CR-01 "refuse-to-post on swapped port" pattern.
//
// Edge case: if the encode took so long that keepalivePort is now null
// (e.g. mid-reconnect with no fresh port yet) we drop with a warn.
// The SW's outer timeout will eventually fire and surface the error to
// the operator (no more silent video-less archive).
if (keepalivePort === null) {
logger.warn(
'port replaced during encode; dropping BUFFER response (SW will time out and retry)',
'port unavailable at post time; dropping BUFFER (SW outer timeout will surface)',
);
return;
}
portAtRequest.postMessage({ type: 'BUFFER', segments: transferred });
try {
keepalivePort.postMessage({
type: 'BUFFER',
requestId,
segments: transferred,
});
} catch (err) {
// The current port may have been disconnected between the null-check
// above and postMessage. Treat as a dropped response — the SW retries
// on the fresh port.
logger.warn('BUFFER post failed (port disconnected mid-post):', err);
}
}
// Triggers a clean reconnect: tear down timers, disconnect the current
// port (if any), null the reference, and re-enter connectPort(). The
// disconnect call fires the onDisconnect handler on the local side too
// (Chrome's documented behaviour for explicit .disconnect()), and that
// handler also calls connectPort() — but the missedPongs reset + the
// teardownPortTimers idempotency keep the second pass a no-op.
function reconnectPort(reason: string): void {
logger.warn('Initiating port reconnect:', reason);
teardownPortTimers();
const stale = keepalivePort;
keepalivePort = null;
missedPongs = 0;
if (stale !== null) {
try {
stale.disconnect();
} catch (err) {
// Disconnecting an already-disconnected port throws on some Chrome
// versions; silently ignored — the local-side reference has already
// been cleared so no further use is possible.
logger.warn('stale.disconnect() during reconnect failed:', err);
}
}
connectPort();
}
function connectPort(): void {
@@ -612,22 +683,47 @@ function connectPort(): void {
keepalivePort = null;
return;
}
missedPongs = 0;
keepalivePort.onMessage.addListener(onPortMessage);
keepalivePort.onDisconnect.addListener(() => {
logger.warn('port disconnected — reconnecting');
teardownPortTimers();
keepalivePort = null;
// Synchronous reconnect — tests/offscreen/port.test.ts pins this
missedPongs = 0;
// Synchronous reconnect — tests/offscreen/port.test.ts pins this.
connectPort();
});
pingIntervalId = setInterval(() => {
keepalivePort?.postMessage({ type: 'PING' });
// H1 fix: wrap postMessage in try/catch. The port object can become
// disconnected between the interval-callback being queued and it
// running (race window between local .disconnect() and onDisconnect
// firing). Chrome's documented runtime semantics make postMessage
// throw synchronously with "Attempting to use a disconnected port
// object". Without the catch, that became an Uncaught Error in the
// offscreen console (3x at 290 s in UAT Test 3).
//
// Health probe: each ping costs us one missed-PONG slot. The SW
// echoes PONG on PING (src/background/index.ts), and onPortMessage
// resets missedPongs to 0 on receipt. If MAX_MISSED_PONGS pings go
// without a PONG, the SW (or the port) is unresponsive and we
// reconnect cleanly.
if (keepalivePort === null) {
return;
}
missedPongs++;
if (missedPongs > MAX_MISSED_PONGS) {
reconnectPort(`missed ${missedPongs} PONGs — port unresponsive`);
return;
}
try {
keepalivePort.postMessage({ type: 'PING' });
} catch (err) {
// The port went disconnected synchronously — initiate the same
// clean reconnect path. The throw caught here is exactly the H1
// "Attempting to use a disconnected port object" surface.
reconnectPort(`PING postMessage threw: ${String(err)}`);
}
}, PORT_PING_MS);
preemptiveReconnectId = setTimeout(() => {
logger.log('pre-emptive port reconnect (290 s cap)');
keepalivePort?.disconnect();
// onDisconnect handler above triggers a fresh connectPort() call
}, PORT_RECONNECT_MS);
}
// Бутстрап (handshake + port lifecycle).