Files
mokosh/.planning/phases/01-stabilize-video-pipeline/01-REVIEW.md

24 KiB
Raw Blame History

phase, reviewed, depth, files_reviewed, files_reviewed_list, findings, status
phase reviewed depth files_reviewed files_reviewed_list findings status
01-stabilize-video-pipeline 2026-05-15T22:30:00Z deep 19
manifest.json
package.json
smoke.sh
src/background/index.ts
src/offscreen/index.html
src/offscreen/recorder.ts
src/shared/binary.ts
src/shared/logger.ts
src/shared/types.ts
tests/offscreen/codec-check.test.ts
tests/offscreen/handshake.test.ts
tests/offscreen/port-serialization.test.ts
tests/offscreen/port.test.ts
tests/offscreen/ring-buffer.test.ts
tests/offscreen/segment-keyframes.test.ts
tests/offscreen/segment-rotation.test.ts
tests/offscreen/webm-playback.test.ts
vite.config.ts
vitest.config.ts
critical warning info total
3 9 6 18
issues_found

Phase 1: Code Review Report

Reviewed: 2026-05-15T22:30:00Z Depth: deep Files Reviewed: 19 Status: issues_found

Summary

Phase 1 delivers the video pipeline stabilization: D-12 (base64 wire format) and D-13 (restart-segments lifecycle) both landed cleanly. The TDD discipline is visible in the test artifacts (port-serialization, segment-keyframes, webm-playback have explicit RED-then-GREEN structure). Type safety in the Phase 1 src/ files is clean — zero as any and zero @ts-ignore in src/background/index.ts, src/offscreen/recorder.ts, src/shared/*.ts. Storage constraint CON-buffer-storage is intact: only indexedDB.deleteDatabase('VideoRecorderDB') appears (idempotent cleanup), no writes to chrome.storage/IDB/localStorage from the new pipeline.

However, deep cross-file analysis surfaces three Critical correctness defects:

  1. A port reconnect race in src/offscreen/recorder.ts that drops in-flight REQUEST_BUFFER responses to a stale port reference (data-loss path).
  2. A port-keepalive ineffectiveness bug on the SW side — the SW never registers an onMessage listener on the port, so Chrome may not credit the PING traffic toward idle-timer reset under all conditions, and the BUFFER response path coexists with a SW-registered listener that conflicts.
  3. A handshake race window that re-opens after a Service Worker respawn: the OFFSCREEN_READY promise is resolved at most once, but the SW may respawn while the offscreen document persists — the second SW lifetime can no longer await readiness and will deadlock on offscreenReady.

Plus 9 Warnings (lifecycle correctness gaps, smoke-script bash hygiene, codec-check side effect, base64 encode/decode performance and edge cases) and 6 Info-level items (style divergence, pre-existing P5-deferred audit items, dead-code branches).

Critical Issues

CR-01: Port reconnect race drops BUFFER response under reconnect-during-encode

File: src/offscreen/recorder.ts:335-388 (combined with src/background/index.ts:111-154) Issue: encodeAndSendBuffer() reads keepalivePort after an await Promise.all(...). If the port disconnects during the ~tens-of-ms base64 encode of three 10s segments, the onDisconnect handler nulls keepalivePort AND synchronously calls connectPort(), which reassigns keepalivePort to a fresh port instance. The post-await check if (keepalivePort === null) { ... drop ... } therefore evaluates false on the reconnected case, and the BUFFER response is posted on the new port — but the SW-side getVideoBufferFromOffscreen is still listening on the old port (captured in const port = videoPort, line 110). Result: silent timeout (BUFFER_FETCH_TIMEOUT_MS = 2_000), SAVE_ARCHIVE produces a zip with segments: []. This is a data-loss path that masquerades as a benign timeout in logs.

Concretely the race is: SW sends REQUEST_BUFFER → offscreen begins encode → port hits its 290 s pre-emptive disconnect → offscreen receives BUFFER on new port → SW timed out and gave up on old port → ZIP has no video.

Fix: Capture the port reference before the await and post on that captured handle (or refuse to post if it's no longer the live keepalivePort):

async function encodeAndSendBuffer(): Promise<void> {
  const portAtRequest = keepalivePort;       // capture identity
  if (portAtRequest === null) return;
  // ... build allSegments / baseTimestamp ...
  const encodeResults = await Promise.all(/* ... */);
  // After await: refuse to post on a new port that the SW isn't listening on.
  if (keepalivePort !== portAtRequest) {
    logger.warn('port replaced during encode; dropping BUFFER response');
    return;
  }
  portAtRequest.postMessage({ type: 'BUFFER', segments: transferred });
}

The SW side already times out cleanly (2 s budget); the correct fix is to let the SW time out so the next save attempt re-issues REQUEST_BUFFER on the new port. Don't post stale data onto a stranger port.


CR-02: SW never registers onMessage on the incoming port — BUFFER response can race a torn-down listener and PING is silently discarded

File: src/background/index.ts:78-96 Issue: The SW-side chrome.runtime.onConnect handler captures videoPort = port and attaches only onDisconnect. The port.onMessage.addListener(handler) for BUFFER is added per-request inside getVideoBufferFromOffscreen() (line 152), then removed on timeout/response. Two consequences:

  1. PING traffic is discarded: When the per-request handler is not installed (which is the normal idle state of the port), PINGs from offscreen have no listener. Per the Chrome 110+ keepalive guarantee, the SW idle-timer reset triggers on inbound port traffic regardless of whether a listener is attached — but in practice "no listener" combined with the message-not-being-consumed has caused subtle behaviour differences in the field. The comment at line 95 ("Inbound traffic is mostly PING (ignored) ...") acknowledges this but doesn't actually attach a sink. Either install a no-op onMessage listener at port-acceptance time (preferred), or document explicitly that PING relies on the kernel-level port-message side effect.
  2. Race window on BUFFER: If the offscreen recorder happens to send a BUFFER for ANY reason (e.g., a stale REQUEST_BUFFER from a previous SW lifetime that's now being processed) before getVideoBufferFromOffscreen registers its per-call handler, the message is dropped. The current architecture treats this as fine — but it means the SW depends on strict ping-pong ordering with no transport-level guarantee.

Fix: Add a permanent port-message router at acceptance time and dispatch BUFFER to subscribers:

chrome.runtime.onConnect.addListener((port) => {
  if (port.name !== 'video-keepalive') return;
  if (port.sender?.id !== chrome.runtime.id) {
    logger.warn('Rejecting port', port.sender?.id);
    port.disconnect();
    return;
  }
  videoPort = port;
  port.onMessage.addListener((msg) => {
    if (typeof msg === 'object' && msg !== null
        && (msg as {type?: unknown}).type === 'PING') {
      return;  // explicit sink — keeps SW alive even with no pending fetch
    }
    // BUFFER routing handled by per-request listener; unknown traffic dropped.
  });
  port.onDisconnect.addListener(() => { videoPort = null; });
});

CR-03: Handshake deadlock after Service Worker respawn

File: src/background/index.ts:32-35, 158-202, 487-491 Issue: offscreenReady is a one-shot promise resolved on the first OFFSCREEN_READY message (line 489 nulls the resolver). If the Service Worker is evicted (Chrome MV3 ~30 s idle) while the offscreen document is still alive, the next SW lifetime will:

  1. Create a fresh offscreenReady Promise at module load (line 33).
  2. Detect the existing offscreen via chrome.offscreen.hasDocument() (line 516) — offscreenCreated = true.
  3. The next REQUEST_PERMISSIONS triggers startVideoCapture() which awaits offscreenReady on line 180.
  4. But the offscreen document has already sent its OFFSCREEN_READY exactly once (line 452 in recorder.ts) at its original bootstrap — it does NOT re-emit on SW respawn.
  5. offscreenReady therefore hangs forever, and startVideoCapture() never returns, until the user closes Chrome.

This is masked in dev because each fresh dev-load tears down everything together. The audit P1 #8 mitigation at lines 514-524 handles the offscreen-exists case for ensureOffscreen() but not the readiness handshake.

Fix: Either (a) have offscreen re-emit OFFSCREEN_READY periodically OR on inbound SW message, OR (b) on SW init, if chrome.offscreen.hasDocument() returns true, immediately resolve offscreenReady (the offscreen has already been ready in a prior life):

async function initialize() {
  // ...
  if (typeof chrome.offscreen?.hasDocument === 'function') {
    const exists = await chrome.offscreen.hasDocument();
    if (exists) {
      offscreenCreated = true;
      // Offscreen already bootstrapped in a prior SW lifetime — don't wait
      // for an OFFSCREEN_READY that will never come.
      offscreenReadyResolve?.();
      offscreenReadyResolve = null;
      logger.log('Existing offscreen — resolving readiness immediately');
    }
  }
}

Option (b) is simpler and sufficient. Add a regression test that imports src/background/index.ts with a chrome.offscreen.hasDocument stub returning true and asserts that offscreenReady resolves without needing an OFFSCREEN_READY message.

Warnings

WR-01: getDisplayMedia rejection from user-cancel surfaces as RECORDING_ERROR with raw DOMException message

File: src/offscreen/recorder.ts:146-151 Issue: When the operator dismisses the picker, Chrome rejects getDisplayMedia with NotAllowedError. The catch block at line 146 forwards the raw error.message (e.g., "Permission denied by user") into a runtime broadcast. Per CLAUDE.md (Python guidelines also apply philosophically: "Sanitize error messages: log details server-side, return generic messages to clients"), the runtime message should be a stable, programmatically actionable code; raw browser error strings change between Chrome versions and locales.

Fix: Classify the failure mode ('user-cancelled', 'codec-unsupported', 'permission-denied', 'unknown') and emit a code; log the raw string at warn level only:

} catch (error) {
  const code = classifyDisplayMediaError(error);  // → 'user-cancelled' | ...
  logger.error('startRecording failed:', error);
  chrome.runtime.sendMessage({ type: 'RECORDING_ERROR', error: code });
  throw error;
}

WR-02: assertCodecSupported() has a side-effect on a side-effect (chrome.runtime.sendMessage from a pure-named function)

File: src/offscreen/recorder.ts:102-113 Issue: A function named assertCodecSupported reads as a pure predicate that throws. It actually broadcasts a RECORDING_ERROR over chrome.runtime.sendMessage before throwing. This couples the assertion to extension runtime, breaks single responsibility, and forces the test (codec-check.test.ts) to stub chrome.runtime.sendMessage even when only the assertion logic is under test. Also, startRecording() then re-emits RECORDING_ERROR for the same failure in its catch block (line 149), so the popup receives two RECORDING_ERROR messages for one underlying problem.

Fix: Either rename to assertCodecSupportedAndNotify, or remove the sendMessage from the assertion and rely on the caller's catch block. Recommended: extract notification to the caller.

export function assertCodecSupported(): void {
  if (!isCodecSupported()) {
    throw new Error(`vp9 unsupported. UA=${navigator?.userAgent ?? '<unknown>'}`);
  }
}
// startRecording handles the notify-on-throw via its existing catch.

This also fixes the double-emit. Update codec-check.test.ts GREEN block accordingly.


WR-03: mergeVideoSegments sort uses non-stable timestamp resolution; equal timestamps may reorder

File: src/background/index.ts:254 + src/offscreen/recorder.ts:352, 366 Issue: The offscreen recorder uses baseTimestamp + idx (Date.now() + idx) to tag segments — millisecond resolution + index offset. Two consecutive REQUEST_BUFFER calls within the same millisecond would both produce timestamps starting at the same Date.now(). While the merging code now operates on a single REQUEST_BUFFER batch (so collisions don't happen in practice), the contract is fragile: a future change that calls encodeAndSendBuffer twice (e.g., diagnostic prefetch + actual save) would produce duplicate timestamps. Array.prototype.sort is stable per ES2019 spec, so the in-batch order is preserved, but cross-batch order is not guaranteed by anything visible. Document this or move to a strictly monotonic counter.

Fix: Replace Date.now() + idx with a module-level monotonic counter:

let segmentSeq = 0;
// ...
return { data, type: segment.type || 'video/webm', timestamp: ++segmentSeq };

WR-04: Smoke script — set -euo pipefail interacts badly with read -r -d '' and grep ... || true patterns

File: smoke.sh:92, 109, 161, 197 Issue: Several locations rely on || true to swallow non-zero exits, which is correct under Google's shell style but fragile when piping. Specifically:

  • Line 109: python3 -c '...' 2>/dev/null || printf '%s' "${SMOKE_HTML}" — the fallback path emits unencoded HTML into a data: URL. If python3 is missing, the URL will contain literal <, >, spaces, and quotes — Chrome may fail to parse the URL silently, and the operator will see a blank tab.
  • Line 161: find ... | sort -rn | head -1 | cut -d' ' -f2- || true|| true on the last segment of a pipe under pipefail masks errors mid-pipe.
  • Line 197: ffprobe ... && GATE=0 || GATE=$? — this is the classic anti-pattern: GATE=0 is itself a command that returns 0, so || GATE=$? is unreachable. GATE is always 0 here, even on ffprobe failure. This means the acceptance gate is broken — but only structurally: the script proceeds to if [[ ${GATE} -eq 0 ]] and prints "PASSED" regardless of ffprobe.

Wait — let me re-read. ffprobe ... && GATE=0 || GATE=$? — if ffprobe exits 0, then && runs GATE=0 which returns 0, so || short-circuits. If ffprobe exits non-zero, && is skipped, so || runs GATE=$?. But $? here is the exit code of the && chain, which is ffprobe's exit. Actually correct because the && short-circuits. False alarm on this one.

The python3 fallback is a real defect though.

Fix:

  1. Require python3 explicitly in pre-flight:
command -v python3 >/dev/null || { red "FAIL: python3 not installed (needed for URL encoding)"; exit 1; }

Then remove the || printf '%s' "${SMOKE_HTML}" fallback. 2. Consider using pipefail semantics consistently; the || true at line 161 should be inside the subshell, not after the pipe.


WR-05: Smoke script writes operator's ~/Downloads directory and globs all session_report_*.zip — risk of false-positive detection

File: smoke.sh:40, 74, 158, 161, 174-180 Issue: The script polls ~/Downloads for any file matching session_report_*.zip. If the operator was running the extension prior to the smoke test, or is running it in another browser window, an unrelated session_report could trigger the polling success and the script will ffprobe the wrong file. The BEFORE_LIST count (line 74) mitigates this for count but not for identity — if one pre-existing zip and one new zip both exist, LATEST (line 161) picks by mtime, which is correct, but the count comparison NEW_COUNT > BEFORE_LIST is satisfied by any operation that adds a matching zip.

Fix: Snapshot the full list before, diff after:

BEFORE_ZIPS=$(find "${DOWNLOADS_DIR}" -maxdepth 1 -name 'session_report_*.zip' -printf '%p\n' | sort)
# ... later ...
AFTER_ZIPS=$(find ...)
NEW_ZIP=$(comm -13 <(printf '%s\n' "${BEFORE_ZIPS}") <(printf '%s\n' "${AFTER_ZIPS}") | head -1)

Additionally consider redirecting the smoke download to /tmp/mokosh-smoke-downloads via chrome.downloads API or a custom --download-directory flag — but that requires Chrome side changes.


WR-06: blobToBase64 ⇒ O(n²) string concatenation for large segments

File: src/shared/binary.ts:48-52 Issue: Per-byte binary += String.fromCharCode(bytes[i]) for a ~1.5 MB segment produces ~1.5M string concatenations. V8 optimizes ConsString chains, but the worst case is allocation-heavy and pauses the event loop visibly. The phase context notes "~1.5 MB segments" — that's ~1.5M iterations per Promise.all call (3 segments × ~1.5MB = ~4.5M iterations on each REQUEST_BUFFER). The 2-second SW timeout (BUFFER_FETCH_TIMEOUT_MS) leaves headroom in benchmarks today, but this is one of the dimensions that grows with future quality bumps (e.g., raising VIDEO_BITRATE or SEGMENT_DURATION_MS).

This is out of v1 review scope (performance) per <review_scope>, BUT it ALSO has a correctness edge: String.fromCharCode(...subarray) via .apply() blows the argument-length limit (~64 KiB) on some engines. The current per-byte loop avoids that. So the comment at lines 45-47 is correct — the trade-off is intentional. Documenting this is fine; no behavioural fix required for v1. Flagged for awareness if Phase 2 quality model includes higher bitrate.

Fix: None needed for v1. For Phase 2, consider chunked apply with 32 KiB stride:

const CHUNK = 0x8000;
const parts: string[] = [];
for (let i = 0; i < bytes.length; i += CHUNK) {
  parts.push(String.fromCharCode.apply(null, bytes.subarray(i, i + CHUNK) as any));
}
return btoa(parts.join(''));

WR-07: base64ToBlob does not validate input — malformed base64 throws cryptic InvalidCharacterError

File: src/shared/binary.ts:67-74 Issue: If a malformed base64 string slips through (e.g., a port message corrupted in transit, though TLS-equivalent inside Chrome makes this near-impossible), atob(b64) throws InvalidCharacterError from the SW caller. The caller (src/background/index.ts:141-147) does have a try/catch, so a segment is dropped on failure — fine. But the b64 parameter has no input validation: empty string atob('') === '' returns an empty Blob silently. Empty Blob has size 0, but currently no caller filters zero-size segments out of the WebM concat result, which would produce a slightly malformed multi-EBML file.

Fix: Add a defensive shortcut for empty input:

export function base64ToBlob(b64: string, mimeType: string): Blob {
  if (b64.length === 0) {
    return new Blob([], { type: mimeType });  // explicit, not via atob('')
  }
  // ...
}

And in getVideoBufferFromOffscreen filter out empty segments before pushing to mergeVideoSegments.


WR-08: downloadArchive builds a Data URL with btoa(per-byte binary string) — re-implements the same pattern with the same potential issue

File: src/background/index.ts:350-358 Issue: This duplicates the body of blobToBase64 inline instead of importing it. Same O(n²) concern as WR-06, applied to the entire zip (which is the SUM of all segments, ~5-10 MB after JSZip compression on three ~1.5MB segments). At 10 MB this is on the edge — ~10M iterations of string concatenation. More importantly: code duplication. The src/shared/binary.ts helper exists; this caller should use it.

Fix:

import { blobToBase64 } from '../shared/binary';
// ...
const base64 = await blobToBase64(archiveBlob);
const url = `data:application/zip;base64,${base64}`;

WR-09: mergeVideoSegments accepts VideoSegment[] but offscreen-side encodeAndSendBuffer may inject an unfinalized in-flight segment

File: src/offscreen/recorder.ts:340-348 and src/background/index.ts:248-274 Issue: encodeAndSendBuffer includes the currently-recording chunks as an additional segment (lines 342-347). This in-flight Blob lacks the WebM-finalization that MediaRecorder.stop() performs (Matroska SegmentSize, Cues). The phase context (debug session webm-playback-freeze) explicitly notes that this finalization is what fixes the "File ended prematurely" ffmpeg error. So including an unfinalized in-flight segment could re-introduce the prematurely-ended-file symptom for the LAST segment in the merged file.

The trade-off comment at line 340 acknowledges this exists to avoid an empty-buffer UX issue if SAVE_ARCHIVE fires within 10 s of START_RECORDING. But the contract is: Chrome's multi-EBML-header concat works because each segment is self-contained. An unfinalized segment is not self-contained.

Fix (options): (a) Exclude the in-flight segment unless segments.length === 0 (preserve UX for first-10-s case, otherwise rely on finalized data only). (b) Force a synchronous rotation at REQUEST_BUFFER time: call rotateSegment(), await its onstop, then encode getSegments(). The added latency is bounded.

Option (a) recommended for least change:

const allSegments: Blob[] =
  finalized.length === 0 && inFlight !== null
    ? [inFlight]
    : finalized;

Info

IN-01: extensionVersion: '1.0.0' hardcoded in metadata — pre-existing P5-deferred audit item

File: src/background/index.ts:324 + manifest.json:4 Issue: Per Phase 1 context note (9), this is a known pre-existing audit item deferred to Phase 5. Surfaced here for tracking continuity. The hardcoded '1.0.0' will silently lie about the running extension version once manifest.json bumps. Currently both are 1.0.0 so the bug is dormant. Fix (Phase 5): extensionVersion: chrome.runtime.getManifest().version


IN-02: Logger and ContentLogger use ...args: any[] while OffscreenLogger uses ...args: unknown[]

File: src/shared/logger.ts:9, 14-22, 35, 40-50, 63, 68-78 Issue: Style divergence is explicitly called out in the comment at lines 54-55 — referred to as plan 01-03's style_divergence_note. Not actionable for Phase 1; flag for Phase 5 cleanup. The risk: a wholesale ts-strict tightening pass will flag Logger / ContentLogger and not OffscreenLogger, increasing review noise.

Fix (Phase 5): Migrate Logger and ContentLogger to ...args: unknown[] once the consumers' call sites are audited for type safety.


IN-03: tests/offscreen/ring-buffer.test.ts is a vestigial 30-line breadcrumb — consider deleting

File: tests/offscreen/ring-buffer.test.ts:1-29 Issue: The file is intentionally kept as a "breadcrumb" pointing readers to segment-rotation.test.ts. This is fine for one phase but accumulates archaeology over time. A git mv rename in the same commit (with git's rename detection) would preserve the breadcrumb in git log --follow without leaving a 30-line tombstone in the working tree.

Fix (optional): Delete the file; rely on git log -- tests/offscreen/segment-rotation.test.ts or a MIGRATIONS.md doc to surface the rename.


IN-04: tests/offscreen/webm-playback.test.ts defines decodeDryRun but never calls it; uses void decodeDryRun to silence noUnusedLocals

File: tests/offscreen/webm-playback.test.ts:65-105, 172-178 Issue: decodeDryRun (using execFileSync) is kept "for documentation value" but never executed. This is dead code in production-test files. The void decodeDryRun pattern at line 178 is a tsc-appeasement hack. The comment block at lines 172-176 articulates the intent, but the cleaner answer is to extract the comparison into a JSDoc block, delete the unused function, and keep only decodeDryRunStrict.

Fix: Delete decodeDryRun and the void suppression.


IN-05: Message<T = any> interface uses any type parameter default

File: src/shared/types.ts:18-22 Issue: export interface Message<T = any> { type: MessageType; data?: T; tabId?: number; } — the any default leaks through everywhere a Message is referenced without an explicit type argument (most call sites do). Recommend unknown default with mandatory narrowing at callers. Fix (Phase 5): Message<T = unknown>. Audit downstream (msg as Message).data accesses for explicit narrowing.


IN-06: Manifest permissions — activeTab + <all_urls> host_permissions is broader than strictly required for getDisplayMedia flow

File: manifest.json:6-15 Issue: After the D-01 amendment (chrome.tabCapture → getDisplayMedia), the activeTab permission is no longer used by the video path. It remains in use by chrome.tabs.captureVisibleTab (line 228 of background/index.ts) for the screenshot. That's legitimate. However host_permissions: ["<all_urls>"] is required only by the content script (rrweb injection). This is documented sufficient-but-broad in CONTEXT.md.

No defect — flagged for principle-of-least-privilege awareness for any future scoping work.


Reviewed: 2026-05-15T22:30:00Z Reviewer: Claude (gsd-code-reviewer) Depth: deep