docs(option-c): archive empty-archive-port-race + amend CONTEXT.md D-17 port lifecycle

Two doc updates closing the debug session per the resolved pattern this
phase has established (cf. resolved/d12-blob-port-transfer-fails.md and
resolved/webm-playback-freeze.md):

1. **Move debug session to resolved/** with the Resolution section
   filled in (root_cause, fix, verification, files_changed). Status
   flipped tdd_red_confirmed -> resolved. Original investigation
   notes + bisect results + Option C strategy spec all preserved
   in-place — the file is the full provenance trail.

2. **Amend 01-CONTEXT.md D-17** with the new port lifecycle commitments.
   Append-only (D-17 itself untouched) per the doc cascade rule
   established earlier this phase ("amendments append, do not replace,
   to preserve SPEC provenance"). The amendment narrates:
   - What was Claude's-discretion at Phase 1 plan time has been
     specified by Option C.
   - The 290 s pre-emptive setTimeout reconnect (Pitfall 4) is RETIRED.
   - The architectural commitments added: PING/PONG health probe,
     request-id'd REQUEST_BUFFER/BUFFER, SW retry on port replacement,
     outer 10 s hard-timeout, operator-visible EmptyVideoBufferError
     surface.
   - The 4 pinning contracts added (port-health-probe,
     request-id-protocol, port-lifecycle-continuous, plus the
     refactored port-reconnect-race).

Suite remains 11 files / 53 tests, all GREEN. Quality gates intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 15:40:02 +02:00
parent 246eadb2ef
commit f0871c0237
2 changed files with 171 additions and 6 deletions

View File

@@ -1,6 +1,6 @@
--- ---
slug: empty-archive-port-race slug: empty-archive-port-race
status: tdd_red_confirmed status: resolved
trigger: | trigger: |
Phase 1 UAT Test 3 surfaced a two-headed BLOCKER: Phase 1 UAT Test 3 surfaced a two-headed BLOCKER:
(a) Archive shipped via popup save contains NO video/last_30sec.webm (a) Archive shipped via popup save contains NO video/last_30sec.webm
@@ -15,7 +15,7 @@ trigger: |
src/background/index.ts:130-167, 340-393 (BUFFER round-trip + zip src/background/index.ts:130-167, 340-393 (BUFFER round-trip + zip
build branch that silently elides the video file on empty segments). build branch that silently elides the video file on empty segments).
created: 2026-05-16T11:58:51Z created: 2026-05-16T11:58:51Z
updated: 2026-05-16T12:09:40Z updated: 2026-05-16T13:00:00Z
phase: 01-stabilize-video-pipeline phase: 01-stabilize-video-pipeline
related_uat: .planning/phases/01-stabilize-video-pipeline/01-UAT.md related_uat: .planning/phases/01-stabilize-video-pipeline/01-UAT.md
related_review_fix: .planning/phases/01-stabilize-video-pipeline/01-REVIEW-FIX.md related_review_fix: .planning/phases/01-stabilize-video-pipeline/01-REVIEW-FIX.md
@@ -304,10 +304,106 @@ tdd_checkpoint: |
## Resolution ## Resolution
root_cause: "" root_cause: |
fix: "" Two coupled defects sharing a single architectural failure mode.
verification: ""
files_changed: [] H1 (Uncaught Errors): The PING setInterval at recorder.ts:623-625
used `keepalivePort?.postMessage({type:'PING'})`. Optional chaining
only guards against `null`, NOT against a connected-but-being-
disconnected Port object. The 290 s pre-emptive setTimeout
(recorder.ts:626-630) called `keepalivePort?.disconnect()`, creating
a microtask race window before the onDisconnect handler ran the
cleanup. Any ping callback queued in that window threw "Attempting
to use a disconnected port object" synchronously.
H2 (silent empty-video archive): The reconnect-race window left
REQUEST_BUFFER round-trips vulnerable. The offscreen captured
`portAtRequest = keepalivePort` BEFORE await blobToBase64() (CR-01
pattern); on mid-encode reconnect the post-await guard correctly
REFUSED to post on the now-stale portAtRequest. The SW's per-
request listener was bound to that same stale port, so it timed
out (BUFFER_FETCH_TIMEOUT_MS = 2 s) and resolved {segments: []}.
createArchive's branch at index.ts:346-352 silently elided the
video file when segments.length === 0 — operator received an
88 KB zip with no last_30sec.webm. Bisect confirmed this silent-
skip branch was an UPSTREAM defect (commit 555eb05) that the
Plan 01-04 port lifecycle (b064a21) + CR-01 (2e3f524) amplified
from latent to fatal.
fix: |
Option C (Architectural) — request-id'd port protocol + port-health
probe + retry on port replacement + operator-visible error surface.
1. RETIRED the 290 s pre-emptive setTimeout reconnect. Replaced with
PONG-based health probe: each PING expects a PONG echo from the
SW; the offscreen counts missedPongs and reconnects via clean
teardown when MAX_MISSED_PONGS=3 exceeded.
2. H1 fix: wrapped PING postMessage in try/catch. On synchronous
throw, route through reconnectPort() — no more Uncaught Error
bubbles out to the offscreen console.
3. Request-id'd REQUEST_BUFFER / BUFFER: SW generates a uuid per
request (crypto.randomUUID), sends with {type:'REQUEST_BUFFER',
requestId}. Offscreen echoes the same requestId on the BUFFER
response. SW routes via module-level pendingBufferRequests Map
keyed by requestId — port-agnostic, so port replacement does
NOT lose the response.
4. SW retry on port replacement: every onConnect post-bootstrap
scans pendingBufferRequests and re-issues REQUEST_BUFFER on the
fresh port with the SAME requestId. The offscreen posts BUFFER
on the CURRENT keepalivePort (no more stale-port refuse-to-post).
This retires the H2 silent-drop class architecturally.
5. createArchive throws EmptyVideoBufferError on empty video
buffer. saveArchive catches and emits RECORDING_ERROR to the
popup AND returns {success:false, error} — operator sees a
visible failure instead of silently shipping a video-less zip.
6. PING → PONG echo: SW replies to every PING with PONG, closing
the offscreen's health-probe loop.
Outer BUFFER_FETCH_TIMEOUT_MS bumped 2 s → 10 s to cover retries
across multiple port replacements (inner round-trip still
~100-200 ms).
verification: |
All 12 RED-gate tests across 3 files flipped GREEN:
- tests/offscreen/port-reconnect-race.test.ts (H1 + H1.b + H2)
- tests/offscreen/port-health-probe.test.ts (A, B, C, D — health
probe + request-id echo)
- tests/background/request-id-protocol.test.ts (1, 2, 3, 4, 5 —
request-id protocol + retry + error surface + PONG echo)
New continuous-end-to-end pinning contract in
tests/background/port-lifecycle-continuous.test.ts covers 600 s
of port lifecycle (12 ping cycles + 2 forced reconnects + 3
SAVE_ARCHIVE round-trips). Asserts no Uncaught + every BUFFER
carries segments + PONGs round-trip.
Pinning contracts preserved with no regression:
- tests/offscreen/port-serialization.test.ts (D-12): GREEN
- tests/offscreen/segment-rotation.test.ts (D-13): GREEN
- tests/offscreen/webm-playback.test.ts (A3 ffmpeg dry-run): GREEN
Final suite: 11 files / 53 tests, all GREEN.
Quality gates: npx tsc --noEmit exit 0; grep "as any|@ts-ignore"
clean across src/offscreen + src/background/index.ts + src/shared;
npm run build exit 0 (5.71 kB offscreen, 73.79 kB main).
CONTEXT.md D-17 amended (append, not replace) with the new port
lifecycle commitments. Original D-17 left intact.
files_changed:
- src/shared/types.ts (PortMessage.requestId + 'PONG')
- src/offscreen/recorder.ts (health probe + request-id'd encode + H1 try/catch + reconnectPort helper)
- src/background/index.ts (request-id protocol + retry + PONG echo + EmptyVideoBufferError surface + decodeBufferSegments extraction)
- tests/offscreen/port-reconnect-race.test.ts (H1.b refactored to externally-disconnected path; H2 sends requestId)
- tests/offscreen/port-health-probe.test.ts (new — 4 contract tests)
- tests/background/request-id-protocol.test.ts (new — 5 contract tests)
- tests/background/port-lifecycle-continuous.test.ts (new — 600 s e2e simulation)
- .planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md (D-17 amendment appended; original D-17 untouched)
## Bisect Results (2026-05-16T12:15:00Z) ## Bisect Results (2026-05-16T12:15:00Z)

View File

@@ -299,5 +299,74 @@ phase, so downstream phases see a consistent baseline:
--- ---
## Amendment (Phase 01-stabilize-video-pipeline, 2026-05-16) — D-17 port lifecycle
- **AMENDED-BY:** debug session `empty-archive-port-race` (Option C, resolved 2026-05-16)
- **Replaces nothing.** D-17 above stands as the original port-keepalive
contract. This amendment narrows the port LIFECYCLE shape that was
Claude's-discretion at Phase 1 plan time.
- **Background.** UAT Test 3 surfaced two coupled defects: 3× Uncaught
"Attempting to use a disconnected port object" in the offscreen console
starting at the 290 s pre-emptive reconnect mark, AND a silent
empty-video archive shipping to the operator. Bisect proved the H1
port race was introduced by Plan 01-04 (commit `b064a21`) and the H2
silent-skip in `createArchive` was an upstream defect from commit
`555eb05` that the port race amplified from latent to fatal. Full
lineage and strategy rationale: `.planning/debug/resolved/empty-archive-port-race.md`.
- **Architectural commitments retired:**
- The legacy 290 000 ms pre-emptive `setTimeout` reconnect (Pitfall 4)
is RETIRED. Its race window between the synchronous `.disconnect()`
and the `onDisconnect` handler firing was the proximate cause of
H1 — see the bisect notes.
- **Architectural commitments added:**
- **Port health probe (PING ↔ PONG).** The offscreen `PORT_PING_MS`
interval doubles as a liveness probe; each PING expects a PONG echo
from the SW. The offscreen tracks `missedPongs` and reconnects
cleanly when the count exceeds `MAX_MISSED_PONGS = 3` (~75 s of
unresponsive SW — well past Chrome's ~30 s idle threshold, so a
real disconnect would already have surfaced its own
`onDisconnect`). The SW echoes PONG on every PING via the
onConnect-level message sink.
- **Request-id'd REQUEST_BUFFER / BUFFER.** Every `REQUEST_BUFFER`
carries a SW-generated `requestId` (crypto.randomUUID with
Math.random fallback). The offscreen echoes the same `requestId`
on the BUFFER response. The SW routes BUFFER → pending request via
a module-level `Map<requestId, PendingBufferRequest>` — port-
agnostic, so port replacement does not lose the response.
- **Retry on port replacement.** Every `onConnect` (post-bootstrap)
scans `pendingBufferRequests` and re-issues REQUEST_BUFFER on the
fresh port with the SAME requestId. The offscreen posts BUFFER on
the CURRENT `keepalivePort`, the sink matches by id, and the
request resolves. This retires the H2 silent-drop class
architecturally.
- **Outer hard-timeout bumped 2 s → 10 s.** The legacy per-port
`BUFFER_FETCH_TIMEOUT_MS = 2000` was too tight to cover a retry
across a reconnect. The new outer budget covers EVERY retry across
port replacements; the inner per-port round-trip is still
~100-200 ms.
- **Operator-visible failure surface.** `createArchive` throws
`EmptyVideoBufferError` when the video buffer is empty. `saveArchive`
catches and emits `{type:'RECORDING_ERROR', error:'empty-video-buffer'}`
via `chrome.runtime.sendMessage` for the popup, AND returns
`{success:false, error}` in the direct-response path. Replaces the
upstream silent-skip branch in `createArchive` that shipped a
video-less archive in silence.
- **Pinning contracts added:**
- `tests/offscreen/port-health-probe.test.ts` — pins the PING/PONG +
request-id'd encode contract on the offscreen side (4 tests).
- `tests/background/request-id-protocol.test.ts` — pins the SW-side
request-id routing + retry + error-surface contract (5 tests).
- `tests/background/port-lifecycle-continuous.test.ts` — continuous
600 s end-to-end simulation: 12 ping/pong cycles + 2 forced
reconnects + 3 SAVE_ARCHIVE round-trips, asserts no Uncaught and
every BUFFER carries segments.
- `tests/offscreen/port-reconnect-race.test.ts` (refactored): H1.b
no longer pins the retired 290 s setTimeout path — it now pins
the externally-disconnected-port → ping try/catch → reconnect path
that the H1 fix delivers.
---
*Phase: 01-stabilize-video-pipeline* *Phase: 01-stabilize-video-pipeline*
*Context gathered: 2026-05-15* *Context gathered: 2026-05-15*
*Amended: 2026-05-16 (debug session empty-archive-port-race, Option C)*