Files
mokosh/.planning/debug/resolved/empty-archive-port-race.md
Mark f0871c0237 docs(option-c): archive empty-archive-port-race + amend CONTEXT.md D-17 port lifecycle
Two doc updates closing the debug session per the resolved pattern this
phase has established (cf. resolved/d12-blob-port-transfer-fails.md and
resolved/webm-playback-freeze.md):

1. **Move debug session to resolved/** with the Resolution section
   filled in (root_cause, fix, verification, files_changed). Status
   flipped tdd_red_confirmed -> resolved. Original investigation
   notes + bisect results + Option C strategy spec all preserved
   in-place — the file is the full provenance trail.

2. **Amend 01-CONTEXT.md D-17** with the new port lifecycle commitments.
   Append-only (D-17 itself untouched) per the doc cascade rule
   established earlier this phase ("amendments append, do not replace,
   to preserve SPEC provenance"). The amendment narrates:
   - What was Claude's-discretion at Phase 1 plan time has been
     specified by Option C.
   - The 290 s pre-emptive setTimeout reconnect (Pitfall 4) is RETIRED.
   - The architectural commitments added: PING/PONG health probe,
     request-id'd REQUEST_BUFFER/BUFFER, SW retry on port replacement,
     outer 10 s hard-timeout, operator-visible EmptyVideoBufferError
     surface.
   - The 4 pinning contracts added (port-health-probe,
     request-id-protocol, port-lifecycle-continuous, plus the
     refactored port-reconnect-race).

Suite remains 11 files / 53 tests, all GREEN. Quality gates intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 15:40:02 +02:00

503 lines
26 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
slug: empty-archive-port-race
status: resolved
trigger: |
Phase 1 UAT Test 3 surfaced a two-headed BLOCKER:
(a) Archive shipped via popup save contains NO video/last_30sec.webm
(only rrweb/, logs/, screenshot.png, meta.json) — silent data loss
defeating the entire phase goal.
(b) 3× Uncaught Error: "Attempting to use a disconnected port object"
surface in the offscreen console starting at the 290 s pre-emptive
port-reconnect mark (recorder.ts line 627), spaced ~25-30 s
apart (PORT_PING_MS = 25_000 cadence).
Both symptoms cluster around src/offscreen/recorder.ts:597-630
(port reconnect / encodeAndSendBuffer interaction) and
src/background/index.ts:130-167, 340-393 (BUFFER round-trip + zip
build branch that silently elides the video file on empty segments).
created: 2026-05-16T11:58:51Z
updated: 2026-05-16T13:00:00Z
phase: 01-stabilize-video-pipeline
related_uat: .planning/phases/01-stabilize-video-pipeline/01-UAT.md
related_review_fix: .planning/phases/01-stabilize-video-pipeline/01-REVIEW-FIX.md
prior_resolved_sessions:
- .planning/debug/resolved/d12-blob-port-transfer-fails.md
- .planning/debug/resolved/webm-playback-freeze.md
---
# Debug: Empty-archive + port-reconnect race (Phase 1 BLOCKER)
## Symptoms (from UAT Test 3)
**Expected behavior:**
When the operator clicks save in the popup, the produced ZIP archive
contains `video/last_30sec.webm` — a playable WebM with the most recent
30 s of captured screen video derived from D-13 restart-segments ring
buffer (3 × 10 s self-contained segments).
**Actual behavior:**
1. `smoke.sh` waited 600 s before detecting a new archive in ~/Downloads.
2. The detected archive
(`/home/parf/Downloads/session_report_2026-05-16_13-54-52.zip`,
88 254 bytes) contains:
```
rrweb/ (dir, empty)
rrweb/session.json (2 bytes — empty JSON)
logs/ (dir, empty)
logs/events.json (2 bytes — empty JSON)
screenshot.png (87 300 bytes)
meta.json (336 bytes)
```
**No `video/` directory, no `last_30sec.webm`.** Total archive
88 KB vs. expected ~1.5 MB with video.
3. Offscreen console showed 3× `Uncaught Error: Attempting to use a
disconnected port object at offscreen-Bp_IKlxL.js:1:4644` starting
at exactly 290 s after recording start (timestamps 11:50:10, ~11:50:40,
~11:51:00) — coincides with `pre-emptive port reconnect (290 s cap)`
log line at recorder.ts:627.
4. Ring buffer itself was healthy throughout — 33+ segments rotated
cleanly with `kept: 3/3` invariant (sizes 472-799 KB per 10 s segment).
The D-13 restart-segments lifecycle is NOT the broken component.
**Error messages:**
```
offscreen-Bp_IKlxL.js:1 Uncaught Error: Attempting to use a disconnected port object
at offscreen-Bp_IKlxL.js:1:4644
```
And smoke.sh terminal:
```
caution: filename not matched: video/last_30sec.webm
```
**Timeline:**
- Bug introduced: not yet bisected. Most plausible window is one of:
- Plan 01-04 (port keepalive + OFFSCREEN_READY handshake) — added
the long-lived port that the pre-emptive reconnect operates on.
- Plan 01-05 (SW shrink + port host with T-1-04 sender check) —
added `getVideoBufferFromOffscreen` request-response pattern.
- Review-fix iteration 1 (commit 2e3f524 = CR-01 + CR-02 + CR-03 +
WR-03 + WR-09) — CR-01 specifically introduced the
"portAtRequest captured before await + refuse-to-post on stale
port" pattern in encodeAndSendBuffer that drops responses on the
floor without retry.
- Review-fix iteration 2 sweep #5 (commit 034155b "surface
port-replaced-during-fetch diagnostic on buffer timeout") — added
diagnostic logging on timeout but no retry.
- Never worked end-to-end with operator-side save under realistic
recording duration. The original Phase 1 closure on 2026-05-15 was
confirmed with a fresh fixture from a successful smoke run, but
the fixture was regenerated by a smoke session that almost certainly
completed before the 290 s pre-emptive reconnect window
(smoke required ≥ 35 s recording, not 5+ minutes).
**Reproduction:**
1. `npm run build` → produces dist/
2. `KEEP_PROFILE=0 ./smoke.sh`
3. In Chrome: load dist/ unpacked, click extension icon → picker
auto-accepts smoke tab → recording starts (logs confirm).
4. WAIT past the 290 s mark (e.g., 5+ minutes). Watch the offscreen
console: at 290 s see `pre-emptive port reconnect`, immediately
followed by `Uncaught Error: Attempting to use a disconnected port
object`, then 2-3 more such errors at PORT_PING_MS=25_000 intervals.
5. Click extension icon → "Сохранить отчёт об ошибке". Observe the
long wait time (the user saw 600 s — investigate whether multiple
clicks were needed or whether saveArchive has its own retry).
6. Inspect the produced zip: `unzip -l ~/Downloads/session_report_*.zip`
→ no `video/last_30sec.webm` entry.
## Current Focus
hypothesis: |
Two coupled defects sharing the same root cause (port-reconnect race
window in src/offscreen/recorder.ts):
H1 (the noise — Uncaught Errors): The ping `setInterval` at
recorder.ts:623-625 uses `keepalivePort?.postMessage({type:'PING'})`.
Optional chaining only guards against `null`, not against a
connected-but-being-disconnected `Port` object. When the
pre-emptive `setTimeout` fires at 290 s (recorder.ts:626-630) and
calls `keepalivePort?.disconnect()`, there's a microtask race
window before the onDisconnect handler (recorder.ts:616-622) runs
`teardownPortTimers()`. Any ping callback already queued/dispatched
in that window throws "Attempting to use a disconnected port object."
Two follow-on errors at ~25 s and ~50 s post-reconnect suggest the
recursive `connectPort()` re-installs the ping interval against
a NEW port that itself gets disconnected (possibly by a second
pre-emptive cycle, OR by the SW idle-unloading), but the error
remains because the same race recurs.
H2 (the silent data loss — empty-video archive): The same
reconnect-race window leaves REQUEST_BUFFER / BUFFER round-trips
vulnerable. When the operator clicks save during or shortly after
a reconnect:
(a) SW's `getVideoBufferFromOffscreen`
(src/background/index.ts:130-167) sends REQUEST_BUFFER via the
OLD port that just disconnected, OR
(b) Offscreen's `encodeAndSendBuffer`
(src/offscreen/recorder.ts:522-545, 547-604) captures
`portAtRequest` before its `await blobToBase64(...)` for ~3
segments (~150 ms), the pre-emptive reconnect fires mid-encode,
then the guard at line 597-601 correctly REFUSES to post on
the now-stale `portAtRequest` and just returns.
Either way, the SW's per-request listener times out silently after
`BUFFER_FETCH_TIMEOUT_MS = 2_000`, sets `videoBufferResponse =
{ segments: [] }`, and `createArchive`
(src/background/index.ts:340-352) takes the empty branch:
if (videoBufferResponse.segments.length > 0) { ... } else {
logger.warn('✗ No video segments to add'); // only logs
}
→ archive built WITHOUT the video file, shipped to operator silently.
The 600 s smoke wait is consistent with the operator clicking save
multiple times until one REQUEST_BUFFER round-trip happened to
complete on a fresh port — and even then, that single successful
trip may itself have caught only one of the 3 segments before the
next reconnect (worth checking against actual archive's NULL state
vs partial state).
test: |
Initial evidence-gathering test (RED): write a vitest spec that
drives recorder.ts's port lifecycle through a pre-emptive reconnect
while a ping is queued, and asserts NO Uncaught Error escapes.
This pins H1. For H2 a second test drives REQUEST_BUFFER →
encodeAndSendBuffer → reconnect-mid-await → asserts BUFFER is
retried on the fresh port (or that saveArchive surfaces an error
instead of silently dropping).
expecting: |
Initial RED tests fail in current code (proves bugs are
reproducible at the unit level). Then the gsd-debugger isolates
the precise call sequence + lands the GREEN.
next_action: TDD checkpoint — RED tests landed and confirmed failing for the exact UAT error string. Awaiting user confirmation of RED gate before writing GREEN fix.
reasoning_checkpoint: ""
tdd_checkpoint: |
Test file: tests/offscreen/port-reconnect-race.test.ts (3 tests, all RED)
Status: RED ✓ (failures confirm both H1 + H2 hypotheses)
Baseline: 7 files / 40 tests still GREEN (no regression in pinning contracts)
tsc --noEmit: exit 0
Failure outputs:
H1 AssertionError: expected [Function] to not throw an error
but 'Error: Attempting to use a disconnected port object' was thrown
(line 200 — production ping callback throws on disconnected port)
H1.b AssertionError: expected [Function] to not throw an error
but 'Error: Attempting to use a disconnected port object' was thrown
(line 419 — pre-emptive reconnect path triggers the same race)
H2 AssertionError: expected 0 to be greater than or equal to 1
(line 341 — production code silently drops BUFFER when reconnect
arrives mid-encode; neither old nor new port receives any BUFFER)
## Constraints
- TDD mode is ON (workflow.tdd_mode: true). RED test must land before
GREEN fix.
- Auto-loaded memories: `feedback-gsd-ceremony-for-fixes.md` (no
hot-edits; route through /gsd-debug) and
`feedback-no-unilateral-scope-reduction.md` (no scope narrowing;
surface choices via AskUserQuestion).
- Both fixes must NOT regress D-12 (base64 wire format) or D-13
(restart-segments). Existing tests at
`tests/offscreen/port-serialization.test.ts` and
`tests/offscreen/segment-rotation.test.ts` are the pinning contracts.
- Existing tests (40 across 7 files) all green; tsc exit 0; build
exit 0. Any fix must preserve that baseline.
## Files of Interest (preliminary; debugger may expand)
- src/offscreen/recorder.ts:
- 522-545: encodeAndSendBuffer (CR-01 portAtRequest guard)
- 547-604: doEncodeAndSendBuffer (base64 encode loop + WR-03
monotonic seq + stale-port refuse + postMessage)
- 606-630: connectPort (port lifecycle, ping interval, pre-emptive
reconnect setTimeout)
- 616-622: onDisconnect handler (teardownPortTimers + recursive
connectPort)
- 39: PORT_RECONNECT_MS = 290_000
- 38: PORT_PING_MS = 25_000
- src/background/index.ts:
- 130-167: getVideoBufferFromOffscreen (per-request listener
+ BUFFER_FETCH_TIMEOUT_MS = 2_000)
- 340-393: createArchive (the silent-empty-video branch at 346-352)
- 430-493: saveArchive (orchestrator)
- tests/offscreen/port.test.ts (existing — only covers SW-side
disconnect path, NOT pre-emptive reconnect race)
- tests/offscreen/port-serialization.test.ts (D-12 contract)
- tests/offscreen/segment-rotation.test.ts (D-13 contract)
- tests/offscreen/port-reconnect-race.test.ts (NEW — RED gate for this
debug session; pins H1 + H1.b + H2 contracts at unit-test level)
## Evidence
- timestamp: 2026-05-16T12:06:29Z — Baseline established. `npx vitest run`
→ 7 files passed / 40 tests passed. Confirms a clean RED→GREEN baseline
for the new test file.
- timestamp: 2026-05-16T12:06:45Z — Static analysis of src/offscreen/recorder.ts
+ the minified production bundle dist/assets/offscreen-Bp_IKlxL.js
CONFIRMS H1 at the bytecode level:
`grep -oP '.{60}postMessage.{20}' offscreen-Bp_IKlxL.js` yields:
`connecting"),E(),o=null,O()}),p=setInterval(()=>{o==null||o.postMessage({type:"PING"})},k)`
This is exactly the pattern at recorder.ts:623-625 — the ping
callback uses `?.postMessage` whose only safety check is null. The
minified column 4644 in the UAT error stack maps to this site:
`o.postMessage({type:"PING"})` where `o` is a non-null but
disconnected Port. Chrome's documented runtime semantics for Port
say postMessage on a disconnected port throws synchronously with
the exact error string observed in UAT
("Attempting to use a disconnected port object").
- timestamp: 2026-05-16T12:07:00Z — Static analysis of the
doEncodeAndSendBuffer flow CONFIRMS H2 silent-drop:
`src/offscreen/recorder.ts:597-601` —
`if (keepalivePort !== portAtRequest) { logger.warn(...); return; }`
On a mid-encode reconnect, keepalivePort is reassigned by the
onDisconnect→connectPort chain while portAtRequest still points to
the OLD port. The function returns WITHOUT posting anywhere.
`src/background/index.ts:134-150` — SW per-request listener bound
to the captured `port` (the OLD port). After 2 s
(BUFFER_FETCH_TIMEOUT_MS) it resolves `{ segments: [] }`.
`src/background/index.ts:346-352` — `createArchive` branch:
`if (videoBufferResponse.segments.length > 0) { ... } else {
logger.warn('✗ No video segments to add'); }`
No throw, no surface to operator. Archive ships WITHOUT video.
- timestamp: 2026-05-16T12:09:30Z — RED tests written and executed:
`tests/offscreen/port-reconnect-race.test.ts`. 3 tests, all RED:
H1 `ping callback handles a remotely-disconnected port without
throwing` — FAILS with "Attempting to use a disconnected port
object" (matches UAT error string byte-for-byte).
H1.b `pre-emptive reconnect path — ping does not throw against
just-disconnected port` — FAILS with same error (proves the
290 s pre-emptive path is the same race).
H2 `REQUEST_BUFFER mid-reconnect must NOT drop BUFFER when
segments exist` — FAILS with `expected 0 to be greater than
or equal to 1`. 3 seeded segments + mid-encode reconnect →
neither old nor new port receives any BUFFER. Proves the
silent-drop is real and reproducible at the unit level.
- timestamp: 2026-05-16T12:09:40Z — Full-suite check: 8 files (was 7) /
43 tests (was 40) — exactly 3 new RED, 40 baseline still GREEN. No
regression. `npx tsc --noEmit` exit 0 (new test file is type-clean).
## Eliminated
- **Not a buffer-emptiness issue.** UAT logs show 33+ segments rotated
successfully with `kept: 3/3` invariant maintained throughout the
600 s session. The ring buffer always had 3 × ~500-800 KB segments
available when SAVE_ARCHIVE fired. The data is THERE; the transport
fails to deliver it.
- **Not a D-13 segment-lifecycle regression.** existing
segment-rotation.test.ts contract still passes; the WebM payload
format is sound (preserved by the d12 base64 wire-format fix).
- **Not a D-12 wire-format regression.** existing
port-serialization.test.ts contract still passes; the base64 round-
trip is intact. The bug is at the port-lifecycle layer ABOVE the
wire format.
- **Not a smoke.sh artifact.** the actual zip on disk was inspected
by-hand (`unzip -l ~/Downloads/session_report_2026-05-16_13-54-52.zip`)
— the absence of video/last_30sec.webm is on the wire, not in the
smoke harness's reporting.
- **Not a JSZip issue.** other archive entries (screenshot.png, meta.json)
are present and well-formed. JSZip received an archive build with no
video entry — because createArchive's silent-skip branch never
called `zip.file('video/last_30sec.webm', videoBlob)`.
- **Not a chrome.runtime.sendMessage problem.** the bug is on the
long-lived `chrome.runtime.connect` port (offscreen↔SW), not on the
popup↔SW sendMessage path. The popup→SW SAVE_ARCHIVE call works
fine — it just gets `{success: true}` back even though the archive
is video-less.
## Resolution
root_cause: |
Two coupled defects sharing a single architectural failure mode.
H1 (Uncaught Errors): The PING setInterval at recorder.ts:623-625
used `keepalivePort?.postMessage({type:'PING'})`. Optional chaining
only guards against `null`, NOT against a connected-but-being-
disconnected Port object. The 290 s pre-emptive setTimeout
(recorder.ts:626-630) called `keepalivePort?.disconnect()`, creating
a microtask race window before the onDisconnect handler ran the
cleanup. Any ping callback queued in that window threw "Attempting
to use a disconnected port object" synchronously.
H2 (silent empty-video archive): The reconnect-race window left
REQUEST_BUFFER round-trips vulnerable. The offscreen captured
`portAtRequest = keepalivePort` BEFORE await blobToBase64() (CR-01
pattern); on mid-encode reconnect the post-await guard correctly
REFUSED to post on the now-stale portAtRequest. The SW's per-
request listener was bound to that same stale port, so it timed
out (BUFFER_FETCH_TIMEOUT_MS = 2 s) and resolved {segments: []}.
createArchive's branch at index.ts:346-352 silently elided the
video file when segments.length === 0 — operator received an
88 KB zip with no last_30sec.webm. Bisect confirmed this silent-
skip branch was an UPSTREAM defect (commit 555eb05) that the
Plan 01-04 port lifecycle (b064a21) + CR-01 (2e3f524) amplified
from latent to fatal.
fix: |
Option C (Architectural) — request-id'd port protocol + port-health
probe + retry on port replacement + operator-visible error surface.
1. RETIRED the 290 s pre-emptive setTimeout reconnect. Replaced with
PONG-based health probe: each PING expects a PONG echo from the
SW; the offscreen counts missedPongs and reconnects via clean
teardown when MAX_MISSED_PONGS=3 exceeded.
2. H1 fix: wrapped PING postMessage in try/catch. On synchronous
throw, route through reconnectPort() — no more Uncaught Error
bubbles out to the offscreen console.
3. Request-id'd REQUEST_BUFFER / BUFFER: SW generates a uuid per
request (crypto.randomUUID), sends with {type:'REQUEST_BUFFER',
requestId}. Offscreen echoes the same requestId on the BUFFER
response. SW routes via module-level pendingBufferRequests Map
keyed by requestId — port-agnostic, so port replacement does
NOT lose the response.
4. SW retry on port replacement: every onConnect post-bootstrap
scans pendingBufferRequests and re-issues REQUEST_BUFFER on the
fresh port with the SAME requestId. The offscreen posts BUFFER
on the CURRENT keepalivePort (no more stale-port refuse-to-post).
This retires the H2 silent-drop class architecturally.
5. createArchive throws EmptyVideoBufferError on empty video
buffer. saveArchive catches and emits RECORDING_ERROR to the
popup AND returns {success:false, error} — operator sees a
visible failure instead of silently shipping a video-less zip.
6. PING → PONG echo: SW replies to every PING with PONG, closing
the offscreen's health-probe loop.
Outer BUFFER_FETCH_TIMEOUT_MS bumped 2 s → 10 s to cover retries
across multiple port replacements (inner round-trip still
~100-200 ms).
verification: |
All 12 RED-gate tests across 3 files flipped GREEN:
- tests/offscreen/port-reconnect-race.test.ts (H1 + H1.b + H2)
- tests/offscreen/port-health-probe.test.ts (A, B, C, D — health
probe + request-id echo)
- tests/background/request-id-protocol.test.ts (1, 2, 3, 4, 5 —
request-id protocol + retry + error surface + PONG echo)
New continuous-end-to-end pinning contract in
tests/background/port-lifecycle-continuous.test.ts covers 600 s
of port lifecycle (12 ping cycles + 2 forced reconnects + 3
SAVE_ARCHIVE round-trips). Asserts no Uncaught + every BUFFER
carries segments + PONGs round-trip.
Pinning contracts preserved with no regression:
- tests/offscreen/port-serialization.test.ts (D-12): GREEN
- tests/offscreen/segment-rotation.test.ts (D-13): GREEN
- tests/offscreen/webm-playback.test.ts (A3 ffmpeg dry-run): GREEN
Final suite: 11 files / 53 tests, all GREEN.
Quality gates: npx tsc --noEmit exit 0; grep "as any|@ts-ignore"
clean across src/offscreen + src/background/index.ts + src/shared;
npm run build exit 0 (5.71 kB offscreen, 73.79 kB main).
CONTEXT.md D-17 amended (append, not replace) with the new port
lifecycle commitments. Original D-17 left intact.
files_changed:
- src/shared/types.ts (PortMessage.requestId + 'PONG')
- src/offscreen/recorder.ts (health probe + request-id'd encode + H1 try/catch + reconnectPort helper)
- src/background/index.ts (request-id protocol + retry + PONG echo + EmptyVideoBufferError surface + decodeBufferSegments extraction)
- tests/offscreen/port-reconnect-race.test.ts (H1.b refactored to externally-disconnected path; H2 sends requestId)
- tests/offscreen/port-health-probe.test.ts (new — 4 contract tests)
- tests/background/request-id-protocol.test.ts (new — 5 contract tests)
- tests/background/port-lifecycle-continuous.test.ts (new — 600 s e2e simulation)
- .planning/phases/01-stabilize-video-pipeline/01-CONTEXT.md (D-17 amendment appended; original D-17 untouched)
## Bisect Results (2026-05-16T12:15:00Z)
User chose "Bisect first" from the fix-strategy fork. Targeted git-log
scan across the suspect code regions across the full Phase 1 history
(`555eb05..HEAD`):
| Symptom | Mechanism | Introduction commit | Subsequent amplification |
|---------|----------------------------------------|------------------------------------------------------|-----------------------------------------------------------------|
| H1.a | ping setInterval with `?.postMessage` | `b064a21` (Plan 01-04, 2026-05-15) | none — race exists since this commit |
| H1.b | pre-emptive 290 s reconnect setTimeout | `b064a21` (Plan 01-04, 2026-05-15) | none |
| H2 | createArchive silent-skip on empty | **`555eb05` (imported broken upstream — predates Phase 1)** | `2e3f524` CR-01 "refuse-to-post on swapped port" without retry; `034155b` sweep #5 added diagnostic on timeout but still no retry |
**Strategic implications surfaced by bisect:**
1. **H2 silent-skip is an UPSTREAM defect** in the original "half-broken
blob of code" — never on the Phase 1 fix radar because:
- Original 22-defect audit didn't flag it (it's a "what happens if a
downstream returns 0?" question that needs a real failure mode to
surface)
- REVIEW.md's 18 findings didn't catch it
- Sweep #5 (commit 034155b "surface port-replaced-during-fetch
diagnostic on buffer timeout") added a log on timeout but kept the
silent-skip behaviour — the diagnostic logs went to the SW console,
but the operator-facing failure path stayed silent.
2. **H1 port race latency-amplifies H2 from latent to fatal.** Before
Plan 01-04 there was no port → no reconnect → silent-skip never fired
in practice. After 01-04 the 290 s pre-emptive reconnect creates a
guaranteed-recurring window where REQUEST_BUFFER round-trips can
fail; after CR-01 the "refuse to post on swapped port" pattern makes
the silent-skip fire with near-100 % probability whenever a save is
clicked across a reconnect window.
3. **Option A** (target both symptoms with retry + try/catch) is
sufficient to make the visible symptoms go away, but leaves the
upstream silent-skip pattern intact — a class of latent failures
remains. **Option B** (which converts silent-skip into operator-
visible error) directly retires the upstream class. **Option C**
(request-id'd round-trip protocol) makes port-replacement
irrelevant by architecture but is the largest blast radius.
The bisect strengthens the case for B (it's not just symptom-control;
it retires the entire shape of bug, including the upstream one).
However, A is still defensible if treated as a pure hotfix to ship
Phase 1 and the silent-skip fix is consciously deferred to a follow-up
plan. C is justifiable if the user wants Phase 1's port lifecycle to
match the durability of the D-12/D-13 contracts before Phase 2 layers
DOM capture on top.
(Updated by orchestrator after bisect; user decides next.)
## Fix Strategy Chosen: Option C (Architectural)
User decided (post-bisect): **Option C — Architectural port lifecycle refactor.**
Rationale: bisect shows the current port lifecycle's race window is the
proximate cause of every H1 error AND the trigger for the upstream H2
silent-skip class. Option C makes port-replacement architecturally
irrelevant by:
1. Replace pre-emptive 290 s `setTimeout` reconnect with **port-health
probe** (ping/pong protocol; offscreen explicitly waits for SW echo
before considering port live; on any failure → clean teardown +
reconnect).
2. Switch REQUEST_BUFFER / BUFFER to **request-id'd** pattern: SW
sends `{type:'REQUEST_BUFFER', requestId: uuid}`; offscreen
responds with `{type:'BUFFER', requestId, segments}`; SW only
accepts responses matching its current pending requestId. SW
retries on port replacement until success or hard outer timeout.
3. Continuous end-to-end vitest covering 600 s of port lifecycle
(2 reconnects + simulated REQUEST_BUFFER round-trips). Becomes
the new pinning contract for the port lifecycle.
4. **Keep** the existing operator-visible safety from Option B: even
in the new design, if a hard outer timeout fires, `createArchive`
must throw → popup surfaces `RECORDING_ERROR` to operator instead
of silently shipping a video-less archive.
Touches D-17 port-lifecycle contract (CONTEXT.md will be amended).
Expected surface: ~150-200 LOC + 2-3 new test files. Pre-existing
40 GREEN tests + 3 new RED tests must all flip GREEN at completion.
Constraints (re-iterated for the executor):
- TDD: each new behaviour gets a RED test BEFORE GREEN implementation.
- D-12 (base64 wire format) and D-13 (restart-segments) contracts must
not regress. Existing pinning tests at
`tests/offscreen/port-serialization.test.ts` and
`tests/offscreen/segment-rotation.test.ts` are the canaries.
- `npx tsc --noEmit` exit 0, `npm run build` exit 0, type-safety
grep clean must all hold at every commit.
- `tests/fixtures/last_30sec.webm` is currently dirty in the working
tree (operator's recent smoke session staged a new file at that
path). Do NOT commit changes to that file — the committed D-13
fixture is the canonical contract.
(Strategy decided 2026-05-16T12:18:00Z; executor dispatch follows.)