Files
mokosh/.planning/debug/empty-archive-port-race.md
Mark 674c415945 test(debug-empty-archive): RED gate for empty-archive-port-race (H1 + H1.b + H2)
Phase 1 UAT Test 3 surfaced a two-headed BLOCKER:
(a) silent empty-video archive when save crosses a port-reconnect window,
(b) 3x "Attempting to use a disconnected port object" Uncaught Errors
starting at the 290 s pre-emptive reconnect mark.

Bisect confirmed: H1 (port lifecycle race) was introduced by Plan 01-04
(b064a21); H2 (createArchive silent-skip on empty segments) is an upstream
defect (555eb05) that became fatal once CR-01 + sweep #5 guaranteed the
silent-skip branch would fire on every save during a reconnect window.

This commit lands the 3 RED tests at the unit-test level — they match the
UAT error string byte-for-byte for H1/H1.b and pin the silent-drop
contract for H2. They will flip GREEN as the Option C architectural
refactor (request-id'd port protocol + port-health probe + retry +
operator-visible error surface) lands across the next commits.

Baseline: 8 files / 43 tests (40 GREEN, 3 RED). tsc --noEmit exit 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 14:17:45 +02:00

407 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
slug: empty-archive-port-race
status: tdd_red_confirmed
trigger: |
Phase 1 UAT Test 3 surfaced a two-headed BLOCKER:
(a) Archive shipped via popup save contains NO video/last_30sec.webm
(only rrweb/, logs/, screenshot.png, meta.json) — silent data loss
defeating the entire phase goal.
(b) 3× Uncaught Error: "Attempting to use a disconnected port object"
surface in the offscreen console starting at the 290 s pre-emptive
port-reconnect mark (recorder.ts line 627), spaced ~25-30 s
apart (PORT_PING_MS = 25_000 cadence).
Both symptoms cluster around src/offscreen/recorder.ts:597-630
(port reconnect / encodeAndSendBuffer interaction) and
src/background/index.ts:130-167, 340-393 (BUFFER round-trip + zip
build branch that silently elides the video file on empty segments).
created: 2026-05-16T11:58:51Z
updated: 2026-05-16T12:09:40Z
phase: 01-stabilize-video-pipeline
related_uat: .planning/phases/01-stabilize-video-pipeline/01-UAT.md
related_review_fix: .planning/phases/01-stabilize-video-pipeline/01-REVIEW-FIX.md
prior_resolved_sessions:
- .planning/debug/resolved/d12-blob-port-transfer-fails.md
- .planning/debug/resolved/webm-playback-freeze.md
---
# Debug: Empty-archive + port-reconnect race (Phase 1 BLOCKER)
## Symptoms (from UAT Test 3)
**Expected behavior:**
When the operator clicks save in the popup, the produced ZIP archive
contains `video/last_30sec.webm` — a playable WebM with the most recent
30 s of captured screen video derived from D-13 restart-segments ring
buffer (3 × 10 s self-contained segments).
**Actual behavior:**
1. `smoke.sh` waited 600 s before detecting a new archive in ~/Downloads.
2. The detected archive
(`/home/parf/Downloads/session_report_2026-05-16_13-54-52.zip`,
88 254 bytes) contains:
```
rrweb/ (dir, empty)
rrweb/session.json (2 bytes — empty JSON)
logs/ (dir, empty)
logs/events.json (2 bytes — empty JSON)
screenshot.png (87 300 bytes)
meta.json (336 bytes)
```
**No `video/` directory, no `last_30sec.webm`.** Total archive
88 KB vs. expected ~1.5 MB with video.
3. Offscreen console showed 3× `Uncaught Error: Attempting to use a
disconnected port object at offscreen-Bp_IKlxL.js:1:4644` starting
at exactly 290 s after recording start (timestamps 11:50:10, ~11:50:40,
~11:51:00) — coincides with `pre-emptive port reconnect (290 s cap)`
log line at recorder.ts:627.
4. Ring buffer itself was healthy throughout — 33+ segments rotated
cleanly with `kept: 3/3` invariant (sizes 472-799 KB per 10 s segment).
The D-13 restart-segments lifecycle is NOT the broken component.
**Error messages:**
```
offscreen-Bp_IKlxL.js:1 Uncaught Error: Attempting to use a disconnected port object
at offscreen-Bp_IKlxL.js:1:4644
```
And smoke.sh terminal:
```
caution: filename not matched: video/last_30sec.webm
```
**Timeline:**
- Bug introduced: not yet bisected. Most plausible window is one of:
- Plan 01-04 (port keepalive + OFFSCREEN_READY handshake) — added
the long-lived port that the pre-emptive reconnect operates on.
- Plan 01-05 (SW shrink + port host with T-1-04 sender check) —
added `getVideoBufferFromOffscreen` request-response pattern.
- Review-fix iteration 1 (commit 2e3f524 = CR-01 + CR-02 + CR-03 +
WR-03 + WR-09) — CR-01 specifically introduced the
"portAtRequest captured before await + refuse-to-post on stale
port" pattern in encodeAndSendBuffer that drops responses on the
floor without retry.
- Review-fix iteration 2 sweep #5 (commit 034155b "surface
port-replaced-during-fetch diagnostic on buffer timeout") — added
diagnostic logging on timeout but no retry.
- Never worked end-to-end with operator-side save under realistic
recording duration. The original Phase 1 closure on 2026-05-15 was
confirmed with a fresh fixture from a successful smoke run, but
the fixture was regenerated by a smoke session that almost certainly
completed before the 290 s pre-emptive reconnect window
(smoke required ≥ 35 s recording, not 5+ minutes).
**Reproduction:**
1. `npm run build` → produces dist/
2. `KEEP_PROFILE=0 ./smoke.sh`
3. In Chrome: load dist/ unpacked, click extension icon → picker
auto-accepts smoke tab → recording starts (logs confirm).
4. WAIT past the 290 s mark (e.g., 5+ minutes). Watch the offscreen
console: at 290 s see `pre-emptive port reconnect`, immediately
followed by `Uncaught Error: Attempting to use a disconnected port
object`, then 2-3 more such errors at PORT_PING_MS=25_000 intervals.
5. Click extension icon → "Сохранить отчёт об ошибке". Observe the
long wait time (the user saw 600 s — investigate whether multiple
clicks were needed or whether saveArchive has its own retry).
6. Inspect the produced zip: `unzip -l ~/Downloads/session_report_*.zip`
→ no `video/last_30sec.webm` entry.
## Current Focus
hypothesis: |
Two coupled defects sharing the same root cause (port-reconnect race
window in src/offscreen/recorder.ts):
H1 (the noise — Uncaught Errors): The ping `setInterval` at
recorder.ts:623-625 uses `keepalivePort?.postMessage({type:'PING'})`.
Optional chaining only guards against `null`, not against a
connected-but-being-disconnected `Port` object. When the
pre-emptive `setTimeout` fires at 290 s (recorder.ts:626-630) and
calls `keepalivePort?.disconnect()`, there's a microtask race
window before the onDisconnect handler (recorder.ts:616-622) runs
`teardownPortTimers()`. Any ping callback already queued/dispatched
in that window throws "Attempting to use a disconnected port object."
Two follow-on errors at ~25 s and ~50 s post-reconnect suggest the
recursive `connectPort()` re-installs the ping interval against
a NEW port that itself gets disconnected (possibly by a second
pre-emptive cycle, OR by the SW idle-unloading), but the error
remains because the same race recurs.
H2 (the silent data loss — empty-video archive): The same
reconnect-race window leaves REQUEST_BUFFER / BUFFER round-trips
vulnerable. When the operator clicks save during or shortly after
a reconnect:
(a) SW's `getVideoBufferFromOffscreen`
(src/background/index.ts:130-167) sends REQUEST_BUFFER via the
OLD port that just disconnected, OR
(b) Offscreen's `encodeAndSendBuffer`
(src/offscreen/recorder.ts:522-545, 547-604) captures
`portAtRequest` before its `await blobToBase64(...)` for ~3
segments (~150 ms), the pre-emptive reconnect fires mid-encode,
then the guard at line 597-601 correctly REFUSES to post on
the now-stale `portAtRequest` and just returns.
Either way, the SW's per-request listener times out silently after
`BUFFER_FETCH_TIMEOUT_MS = 2_000`, sets `videoBufferResponse =
{ segments: [] }`, and `createArchive`
(src/background/index.ts:340-352) takes the empty branch:
if (videoBufferResponse.segments.length > 0) { ... } else {
logger.warn('✗ No video segments to add'); // only logs
}
→ archive built WITHOUT the video file, shipped to operator silently.
The 600 s smoke wait is consistent with the operator clicking save
multiple times until one REQUEST_BUFFER round-trip happened to
complete on a fresh port — and even then, that single successful
trip may itself have caught only one of the 3 segments before the
next reconnect (worth checking against actual archive's NULL state
vs partial state).
test: |
Initial evidence-gathering test (RED): write a vitest spec that
drives recorder.ts's port lifecycle through a pre-emptive reconnect
while a ping is queued, and asserts NO Uncaught Error escapes.
This pins H1. For H2 a second test drives REQUEST_BUFFER →
encodeAndSendBuffer → reconnect-mid-await → asserts BUFFER is
retried on the fresh port (or that saveArchive surfaces an error
instead of silently dropping).
expecting: |
Initial RED tests fail in current code (proves bugs are
reproducible at the unit level). Then the gsd-debugger isolates
the precise call sequence + lands the GREEN.
next_action: TDD checkpoint — RED tests landed and confirmed failing for the exact UAT error string. Awaiting user confirmation of RED gate before writing GREEN fix.
reasoning_checkpoint: ""
tdd_checkpoint: |
Test file: tests/offscreen/port-reconnect-race.test.ts (3 tests, all RED)
Status: RED ✓ (failures confirm both H1 + H2 hypotheses)
Baseline: 7 files / 40 tests still GREEN (no regression in pinning contracts)
tsc --noEmit: exit 0
Failure outputs:
H1 AssertionError: expected [Function] to not throw an error
but 'Error: Attempting to use a disconnected port object' was thrown
(line 200 — production ping callback throws on disconnected port)
H1.b AssertionError: expected [Function] to not throw an error
but 'Error: Attempting to use a disconnected port object' was thrown
(line 419 — pre-emptive reconnect path triggers the same race)
H2 AssertionError: expected 0 to be greater than or equal to 1
(line 341 — production code silently drops BUFFER when reconnect
arrives mid-encode; neither old nor new port receives any BUFFER)
## Constraints
- TDD mode is ON (workflow.tdd_mode: true). RED test must land before
GREEN fix.
- Auto-loaded memories: `feedback-gsd-ceremony-for-fixes.md` (no
hot-edits; route through /gsd-debug) and
`feedback-no-unilateral-scope-reduction.md` (no scope narrowing;
surface choices via AskUserQuestion).
- Both fixes must NOT regress D-12 (base64 wire format) or D-13
(restart-segments). Existing tests at
`tests/offscreen/port-serialization.test.ts` and
`tests/offscreen/segment-rotation.test.ts` are the pinning contracts.
- Existing tests (40 across 7 files) all green; tsc exit 0; build
exit 0. Any fix must preserve that baseline.
## Files of Interest (preliminary; debugger may expand)
- src/offscreen/recorder.ts:
- 522-545: encodeAndSendBuffer (CR-01 portAtRequest guard)
- 547-604: doEncodeAndSendBuffer (base64 encode loop + WR-03
monotonic seq + stale-port refuse + postMessage)
- 606-630: connectPort (port lifecycle, ping interval, pre-emptive
reconnect setTimeout)
- 616-622: onDisconnect handler (teardownPortTimers + recursive
connectPort)
- 39: PORT_RECONNECT_MS = 290_000
- 38: PORT_PING_MS = 25_000
- src/background/index.ts:
- 130-167: getVideoBufferFromOffscreen (per-request listener
+ BUFFER_FETCH_TIMEOUT_MS = 2_000)
- 340-393: createArchive (the silent-empty-video branch at 346-352)
- 430-493: saveArchive (orchestrator)
- tests/offscreen/port.test.ts (existing — only covers SW-side
disconnect path, NOT pre-emptive reconnect race)
- tests/offscreen/port-serialization.test.ts (D-12 contract)
- tests/offscreen/segment-rotation.test.ts (D-13 contract)
- tests/offscreen/port-reconnect-race.test.ts (NEW — RED gate for this
debug session; pins H1 + H1.b + H2 contracts at unit-test level)
## Evidence
- timestamp: 2026-05-16T12:06:29Z — Baseline established. `npx vitest run`
→ 7 files passed / 40 tests passed. Confirms a clean RED→GREEN baseline
for the new test file.
- timestamp: 2026-05-16T12:06:45Z — Static analysis of src/offscreen/recorder.ts
+ the minified production bundle dist/assets/offscreen-Bp_IKlxL.js
CONFIRMS H1 at the bytecode level:
`grep -oP '.{60}postMessage.{20}' offscreen-Bp_IKlxL.js` yields:
`connecting"),E(),o=null,O()}),p=setInterval(()=>{o==null||o.postMessage({type:"PING"})},k)`
This is exactly the pattern at recorder.ts:623-625 — the ping
callback uses `?.postMessage` whose only safety check is null. The
minified column 4644 in the UAT error stack maps to this site:
`o.postMessage({type:"PING"})` where `o` is a non-null but
disconnected Port. Chrome's documented runtime semantics for Port
say postMessage on a disconnected port throws synchronously with
the exact error string observed in UAT
("Attempting to use a disconnected port object").
- timestamp: 2026-05-16T12:07:00Z — Static analysis of the
doEncodeAndSendBuffer flow CONFIRMS H2 silent-drop:
`src/offscreen/recorder.ts:597-601` —
`if (keepalivePort !== portAtRequest) { logger.warn(...); return; }`
On a mid-encode reconnect, keepalivePort is reassigned by the
onDisconnect→connectPort chain while portAtRequest still points to
the OLD port. The function returns WITHOUT posting anywhere.
`src/background/index.ts:134-150` — SW per-request listener bound
to the captured `port` (the OLD port). After 2 s
(BUFFER_FETCH_TIMEOUT_MS) it resolves `{ segments: [] }`.
`src/background/index.ts:346-352` — `createArchive` branch:
`if (videoBufferResponse.segments.length > 0) { ... } else {
logger.warn('✗ No video segments to add'); }`
No throw, no surface to operator. Archive ships WITHOUT video.
- timestamp: 2026-05-16T12:09:30Z — RED tests written and executed:
`tests/offscreen/port-reconnect-race.test.ts`. 3 tests, all RED:
H1 `ping callback handles a remotely-disconnected port without
throwing` — FAILS with "Attempting to use a disconnected port
object" (matches UAT error string byte-for-byte).
H1.b `pre-emptive reconnect path — ping does not throw against
just-disconnected port` — FAILS with same error (proves the
290 s pre-emptive path is the same race).
H2 `REQUEST_BUFFER mid-reconnect must NOT drop BUFFER when
segments exist` — FAILS with `expected 0 to be greater than
or equal to 1`. 3 seeded segments + mid-encode reconnect →
neither old nor new port receives any BUFFER. Proves the
silent-drop is real and reproducible at the unit level.
- timestamp: 2026-05-16T12:09:40Z — Full-suite check: 8 files (was 7) /
43 tests (was 40) — exactly 3 new RED, 40 baseline still GREEN. No
regression. `npx tsc --noEmit` exit 0 (new test file is type-clean).
## Eliminated
- **Not a buffer-emptiness issue.** UAT logs show 33+ segments rotated
successfully with `kept: 3/3` invariant maintained throughout the
600 s session. The ring buffer always had 3 × ~500-800 KB segments
available when SAVE_ARCHIVE fired. The data is THERE; the transport
fails to deliver it.
- **Not a D-13 segment-lifecycle regression.** existing
segment-rotation.test.ts contract still passes; the WebM payload
format is sound (preserved by the d12 base64 wire-format fix).
- **Not a D-12 wire-format regression.** existing
port-serialization.test.ts contract still passes; the base64 round-
trip is intact. The bug is at the port-lifecycle layer ABOVE the
wire format.
- **Not a smoke.sh artifact.** the actual zip on disk was inspected
by-hand (`unzip -l ~/Downloads/session_report_2026-05-16_13-54-52.zip`)
— the absence of video/last_30sec.webm is on the wire, not in the
smoke harness's reporting.
- **Not a JSZip issue.** other archive entries (screenshot.png, meta.json)
are present and well-formed. JSZip received an archive build with no
video entry — because createArchive's silent-skip branch never
called `zip.file('video/last_30sec.webm', videoBlob)`.
- **Not a chrome.runtime.sendMessage problem.** the bug is on the
long-lived `chrome.runtime.connect` port (offscreen↔SW), not on the
popup↔SW sendMessage path. The popup→SW SAVE_ARCHIVE call works
fine — it just gets `{success: true}` back even though the archive
is video-less.
## Resolution
root_cause: ""
fix: ""
verification: ""
files_changed: []
## Bisect Results (2026-05-16T12:15:00Z)
User chose "Bisect first" from the fix-strategy fork. Targeted git-log
scan across the suspect code regions across the full Phase 1 history
(`555eb05..HEAD`):
| Symptom | Mechanism | Introduction commit | Subsequent amplification |
|---------|----------------------------------------|------------------------------------------------------|-----------------------------------------------------------------|
| H1.a | ping setInterval with `?.postMessage` | `b064a21` (Plan 01-04, 2026-05-15) | none — race exists since this commit |
| H1.b | pre-emptive 290 s reconnect setTimeout | `b064a21` (Plan 01-04, 2026-05-15) | none |
| H2 | createArchive silent-skip on empty | **`555eb05` (imported broken upstream — predates Phase 1)** | `2e3f524` CR-01 "refuse-to-post on swapped port" without retry; `034155b` sweep #5 added diagnostic on timeout but still no retry |
**Strategic implications surfaced by bisect:**
1. **H2 silent-skip is an UPSTREAM defect** in the original "half-broken
blob of code" — never on the Phase 1 fix radar because:
- Original 22-defect audit didn't flag it (it's a "what happens if a
downstream returns 0?" question that needs a real failure mode to
surface)
- REVIEW.md's 18 findings didn't catch it
- Sweep #5 (commit 034155b "surface port-replaced-during-fetch
diagnostic on buffer timeout") added a log on timeout but kept the
silent-skip behaviour — the diagnostic logs went to the SW console,
but the operator-facing failure path stayed silent.
2. **H1 port race latency-amplifies H2 from latent to fatal.** Before
Plan 01-04 there was no port → no reconnect → silent-skip never fired
in practice. After 01-04 the 290 s pre-emptive reconnect creates a
guaranteed-recurring window where REQUEST_BUFFER round-trips can
fail; after CR-01 the "refuse to post on swapped port" pattern makes
the silent-skip fire with near-100 % probability whenever a save is
clicked across a reconnect window.
3. **Option A** (target both symptoms with retry + try/catch) is
sufficient to make the visible symptoms go away, but leaves the
upstream silent-skip pattern intact — a class of latent failures
remains. **Option B** (which converts silent-skip into operator-
visible error) directly retires the upstream class. **Option C**
(request-id'd round-trip protocol) makes port-replacement
irrelevant by architecture but is the largest blast radius.
The bisect strengthens the case for B (it's not just symptom-control;
it retires the entire shape of bug, including the upstream one).
However, A is still defensible if treated as a pure hotfix to ship
Phase 1 and the silent-skip fix is consciously deferred to a follow-up
plan. C is justifiable if the user wants Phase 1's port lifecycle to
match the durability of the D-12/D-13 contracts before Phase 2 layers
DOM capture on top.
(Updated by orchestrator after bisect; user decides next.)
## Fix Strategy Chosen: Option C (Architectural)
User decided (post-bisect): **Option C — Architectural port lifecycle refactor.**
Rationale: bisect shows the current port lifecycle's race window is the
proximate cause of every H1 error AND the trigger for the upstream H2
silent-skip class. Option C makes port-replacement architecturally
irrelevant by:
1. Replace pre-emptive 290 s `setTimeout` reconnect with **port-health
probe** (ping/pong protocol; offscreen explicitly waits for SW echo
before considering port live; on any failure → clean teardown +
reconnect).
2. Switch REQUEST_BUFFER / BUFFER to **request-id'd** pattern: SW
sends `{type:'REQUEST_BUFFER', requestId: uuid}`; offscreen
responds with `{type:'BUFFER', requestId, segments}`; SW only
accepts responses matching its current pending requestId. SW
retries on port replacement until success or hard outer timeout.
3. Continuous end-to-end vitest covering 600 s of port lifecycle
(2 reconnects + simulated REQUEST_BUFFER round-trips). Becomes
the new pinning contract for the port lifecycle.
4. **Keep** the existing operator-visible safety from Option B: even
in the new design, if a hard outer timeout fires, `createArchive`
must throw → popup surfaces `RECORDING_ERROR` to operator instead
of silently shipping a video-less archive.
Touches D-17 port-lifecycle contract (CONTEXT.md will be amended).
Expected surface: ~150-200 LOC + 2-3 new test files. Pre-existing
40 GREEN tests + 3 new RED tests must all flip GREEN at completion.
Constraints (re-iterated for the executor):
- TDD: each new behaviour gets a RED test BEFORE GREEN implementation.
- D-12 (base64 wire format) and D-13 (restart-segments) contracts must
not regress. Existing pinning tests at
`tests/offscreen/port-serialization.test.ts` and
`tests/offscreen/segment-rotation.test.ts` are the canaries.
- `npx tsc --noEmit` exit 0, `npm run build` exit 0, type-safety
grep clean must all hold at every commit.
- `tests/fixtures/last_30sec.webm` is currently dirty in the working
tree (operator's recent smoke session staged a new file at that
path). Do NOT commit changes to that file — the committed D-13
fixture is the canonical contract.
(Strategy decided 2026-05-16T12:18:00Z; executor dispatch follows.)