diff --git a/.planning/debug/empty-archive-port-race.md b/.planning/debug/resolved/empty-archive-port-race.md similarity index 79% rename from .planning/debug/empty-archive-port-race.md rename to .planning/debug/resolved/empty-archive-port-race.md index 07c6eab..628afe0 100644 --- a/.planning/debug/empty-archive-port-race.md +++ b/.planning/debug/resolved/empty-archive-port-race.md @@ -1,6 +1,6 @@ --- slug: empty-archive-port-race -status: tdd_red_confirmed +status: resolved trigger: | Phase 1 UAT Test 3 surfaced a two-headed BLOCKER: (a) Archive shipped via popup save contains NO video/last_30sec.webm @@ -15,7 +15,7 @@ trigger: | src/background/index.ts:130-167, 340-393 (BUFFER round-trip + zip build branch that silently elides the video file on empty segments). created: 2026-05-16T11:58:51Z -updated: 2026-05-16T12:09:40Z +updated: 2026-05-16T13:00:00Z phase: 01-stabilize-video-pipeline related_uat: .planning/phases/01-stabilize-video-pipeline/01-UAT.md related_review_fix: .planning/phases/01-stabilize-video-pipeline/01-REVIEW-FIX.md @@ -304,10 +304,106 @@ tdd_checkpoint: | ## Resolution -root_cause: "" -fix: "" -verification: "" -files_changed: [] +root_cause: | + Two coupled defects sharing a single architectural failure mode. + + H1 (Uncaught Errors): The PING setInterval at recorder.ts:623-625 + used `keepalivePort?.postMessage({type:'PING'})`. Optional chaining + only guards against `null`, NOT against a connected-but-being- + disconnected Port object. The 290 s pre-emptive setTimeout + (recorder.ts:626-630) called `keepalivePort?.disconnect()`, creating + a microtask race window before the onDisconnect handler ran the + cleanup. Any ping callback queued in that window threw "Attempting + to use a disconnected port object" synchronously. + + H2 (silent empty-video archive): The reconnect-race window left + REQUEST_BUFFER round-trips vulnerable. The offscreen captured + `portAtRequest = keepalivePort` BEFORE await blobToBase64() (CR-01 + pattern); on mid-encode reconnect the post-await guard correctly + REFUSED to post on the now-stale portAtRequest. The SW's per- + request listener was bound to that same stale port, so it timed + out (BUFFER_FETCH_TIMEOUT_MS = 2 s) and resolved {segments: []}. + createArchive's branch at index.ts:346-352 silently elided the + video file when segments.length === 0 — operator received an + 88 KB zip with no last_30sec.webm. Bisect confirmed this silent- + skip branch was an UPSTREAM defect (commit 555eb05) that the + Plan 01-04 port lifecycle (b064a21) + CR-01 (2e3f524) amplified + from latent to fatal. + +fix: | + Option C (Architectural) — request-id'd port protocol + port-health + probe + retry on port replacement + operator-visible error surface. + + 1. RETIRED the 290 s pre-emptive setTimeout reconnect. Replaced with + PONG-based health probe: each PING expects a PONG echo from the + SW; the offscreen counts missedPongs and reconnects via clean + teardown when MAX_MISSED_PONGS=3 exceeded. + + 2. H1 fix: wrapped PING postMessage in try/catch. On synchronous + throw, route through reconnectPort() — no more Uncaught Error + bubbles out to the offscreen console. + + 3. Request-id'd REQUEST_BUFFER / BUFFER: SW generates a uuid per + request (crypto.randomUUID), sends with {type:'REQUEST_BUFFER', + requestId}. Offscreen echoes the same requestId on the BUFFER + response. SW routes via module-level pendingBufferRequests Map + keyed by requestId — port-agnostic, so port replacement does + NOT lose the response. + + 4. SW retry on port replacement: every onConnect post-bootstrap + scans pendingBufferRequests and re-issues REQUEST_BUFFER on the + fresh port with the SAME requestId. The offscreen posts BUFFER + on the CURRENT keepalivePort (no more stale-port refuse-to-post). + This retires the H2 silent-drop class architecturally. + + 5. createArchive throws EmptyVideoBufferError on empty video + buffer. saveArchive catches and emits RECORDING_ERROR to the + popup AND returns {success:false, error} — operator sees a + visible failure instead of silently shipping a video-less zip. + + 6. PING → PONG echo: SW replies to every PING with PONG, closing + the offscreen's health-probe loop. + + Outer BUFFER_FETCH_TIMEOUT_MS bumped 2 s → 10 s to cover retries + across multiple port replacements (inner round-trip still + ~100-200 ms). + +verification: | + All 12 RED-gate tests across 3 files flipped GREEN: + - tests/offscreen/port-reconnect-race.test.ts (H1 + H1.b + H2) + - tests/offscreen/port-health-probe.test.ts (A, B, C, D — health + probe + request-id echo) + - tests/background/request-id-protocol.test.ts (1, 2, 3, 4, 5 — + request-id protocol + retry + error surface + PONG echo) + + New continuous-end-to-end pinning contract in + tests/background/port-lifecycle-continuous.test.ts covers 600 s + of port lifecycle (12 ping cycles + 2 forced reconnects + 3 + SAVE_ARCHIVE round-trips). Asserts no Uncaught + every BUFFER + carries segments + PONGs round-trip. + + Pinning contracts preserved with no regression: + - tests/offscreen/port-serialization.test.ts (D-12): GREEN + - tests/offscreen/segment-rotation.test.ts (D-13): GREEN + - tests/offscreen/webm-playback.test.ts (A3 ffmpeg dry-run): GREEN + + Final suite: 11 files / 53 tests, all GREEN. + Quality gates: npx tsc --noEmit exit 0; grep "as any|@ts-ignore" + clean across src/offscreen + src/background/index.ts + src/shared; + npm run build exit 0 (5.71 kB offscreen, 73.79 kB main). + + CONTEXT.md D-17 amended (append, not replace) with the new port + lifecycle commitments. Original D-17 left intact. + +files_changed: + - src/shared/types.ts (PortMessage.requestId + 'PONG') + - src/offscreen/recorder.ts (health probe + request-id'd encode + H1 try/catch + reconnectPort helper) + - src/background/index.ts (request-id protocol + retry + PONG echo + EmptyVideoBufferError surface + decodeBufferSegments extraction) + - tests/offscreen/port-reconnect-race.test.ts (H1.b refactored to externally-disconnected path; H2 sends requestId) + - tests/offscreen/port-health-probe.test.ts (new — 4 contract tests) + - tests/background/request-id-protocol.test.ts (new — 5 contract tests) + - tests/background/port-lifecycle-continuous.test.ts (new — 600 s e2e simulation) + - .planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md (D-17 amendment appended; original D-17 untouched) ## Bisect Results (2026-05-16T12:15:00Z) diff --git a/.planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md b/.planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md index 68adf3e..ce11a55 100644 --- a/.planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md +++ b/.planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md @@ -299,5 +299,74 @@ phase, so downstream phases see a consistent baseline: --- +## Amendment (Phase 01-stabilize-video-pipeline, 2026-05-16) — D-17 port lifecycle + +- **AMENDED-BY:** debug session `empty-archive-port-race` (Option C, resolved 2026-05-16) +- **Replaces nothing.** D-17 above stands as the original port-keepalive + contract. This amendment narrows the port LIFECYCLE shape that was + Claude's-discretion at Phase 1 plan time. +- **Background.** UAT Test 3 surfaced two coupled defects: 3× Uncaught + "Attempting to use a disconnected port object" in the offscreen console + starting at the 290 s pre-emptive reconnect mark, AND a silent + empty-video archive shipping to the operator. Bisect proved the H1 + port race was introduced by Plan 01-04 (commit `b064a21`) and the H2 + silent-skip in `createArchive` was an upstream defect from commit + `555eb05` that the port race amplified from latent to fatal. Full + lineage and strategy rationale: `.planning/debug/resolved/empty-archive-port-race.md`. +- **Architectural commitments retired:** + - The legacy 290 000 ms pre-emptive `setTimeout` reconnect (Pitfall 4) + is RETIRED. Its race window between the synchronous `.disconnect()` + and the `onDisconnect` handler firing was the proximate cause of + H1 — see the bisect notes. +- **Architectural commitments added:** + - **Port health probe (PING ↔ PONG).** The offscreen `PORT_PING_MS` + interval doubles as a liveness probe; each PING expects a PONG echo + from the SW. The offscreen tracks `missedPongs` and reconnects + cleanly when the count exceeds `MAX_MISSED_PONGS = 3` (~75 s of + unresponsive SW — well past Chrome's ~30 s idle threshold, so a + real disconnect would already have surfaced its own + `onDisconnect`). The SW echoes PONG on every PING via the + onConnect-level message sink. + - **Request-id'd REQUEST_BUFFER / BUFFER.** Every `REQUEST_BUFFER` + carries a SW-generated `requestId` (crypto.randomUUID with + Math.random fallback). The offscreen echoes the same `requestId` + on the BUFFER response. The SW routes BUFFER → pending request via + a module-level `Map` — port- + agnostic, so port replacement does not lose the response. + - **Retry on port replacement.** Every `onConnect` (post-bootstrap) + scans `pendingBufferRequests` and re-issues REQUEST_BUFFER on the + fresh port with the SAME requestId. The offscreen posts BUFFER on + the CURRENT `keepalivePort`, the sink matches by id, and the + request resolves. This retires the H2 silent-drop class + architecturally. + - **Outer hard-timeout bumped 2 s → 10 s.** The legacy per-port + `BUFFER_FETCH_TIMEOUT_MS = 2000` was too tight to cover a retry + across a reconnect. The new outer budget covers EVERY retry across + port replacements; the inner per-port round-trip is still + ~100-200 ms. + - **Operator-visible failure surface.** `createArchive` throws + `EmptyVideoBufferError` when the video buffer is empty. `saveArchive` + catches and emits `{type:'RECORDING_ERROR', error:'empty-video-buffer'}` + via `chrome.runtime.sendMessage` for the popup, AND returns + `{success:false, error}` in the direct-response path. Replaces the + upstream silent-skip branch in `createArchive` that shipped a + video-less archive in silence. +- **Pinning contracts added:** + - `tests/offscreen/port-health-probe.test.ts` — pins the PING/PONG + + request-id'd encode contract on the offscreen side (4 tests). + - `tests/background/request-id-protocol.test.ts` — pins the SW-side + request-id routing + retry + error-surface contract (5 tests). + - `tests/background/port-lifecycle-continuous.test.ts` — continuous + 600 s end-to-end simulation: 12 ping/pong cycles + 2 forced + reconnects + 3 SAVE_ARCHIVE round-trips, asserts no Uncaught and + every BUFFER carries segments. + - `tests/offscreen/port-reconnect-race.test.ts` (refactored): H1.b + no longer pins the retired 290 s setTimeout path — it now pins + the externally-disconnected-port → ping try/catch → reconnect path + that the H1 fix delivers. + +--- + *Phase: 01-stabilize-video-pipeline* *Context gathered: 2026-05-15* +*Amended: 2026-05-16 (debug session empty-archive-port-race, Option C)*