Feature Specifications

Detailed specifications for every entry in the SideCar release plan. Each entry describes the problem, the mechanism, integration points, and the configuration surface. Organized thematically.

Note: This is a thematic spec archive, not a live backlog — many items below have since shipped (e.g. Fork & Parallel Solve, Typed Sub-Agent Facets, Skills 2.0, Skill Sync & Registry, the Project Knowledge Index, and the Merkle layer). The CHANGELOG is the source of truth for what’s released (the ROADMAP is forward-looking only); treat a spec here as design intent that may already be live.

→ Back to ROADMAP

Context & Intelligence

SIDECAR.md Retrieval-Mode — Semantic Section Scoring in the Fusion Pipeline — the retrieval-based successor to path-scoped injection above. Once the parseSidecarMd primitive (v0.67) has landed, this entry layers a SidecarMdRetriever on top that joins the existing fuseRetrievers pipeline (DocRetriever / MemoryRetriever / SemanticRetriever). Why retrieval is better for some workspaces: path-scoped routing assumes users know which sections apply to which paths, and they annotate accordingly. Large projects with organically-grown SIDECAR.md files (50+ sections, inconsistent heading naming, overlap between section scopes) often don’t — and asking the model “which sections of this doc are relevant to the question I’m asking right now?” is exactly the problem retrieval is good at. Mechanism: on workspace init, every section body is embedded with the same all-MiniLM-L6-v2 model used elsewhere in the retrieval stack and stored in a namespaced LanceDB table (or the flat fallback) at .sidecar/cache/sidecarMd/. On each turn, the retriever scores sections against the fused query (user message + active file path + recent tool_result summaries) via cosine similarity, applies RRF against the other retrievers, and surfaces the top-K as [SIDECAR.md · §<heading>]-tagged hits in the fused context block. Incremental updates: same pattern as Project Knowledge Index — fs.watch on SIDECAR.md triggers re-parse + per-section hash compare; only changed sections re-embed. Saves survive across sessions. Hybrid with path routing: sections and retrieval compose naturally. When both are enabled, the path-scoped always sections are always included verbatim (cost = no tokens consumed by retrieval scoring for always-sections), and the retriever scores only the scoped + low pool against the current query to pick top-K. That preserves the deterministic “Build” / “Conventions” inclusion while letting retrieval surface the right “Transforms” section without relying on a path glob the author never wrote. Faithfulness audit via the existing RAG-eval harness: a new golden-case fixture at src/test/retrieval-eval/sidecarMdGolden.ts asserts that for a query like “how do I add a new transform kernel?” the retrieved section is the one tagged ## Transforms and NOT the ## Database section that happens to share the word “index.” Failures become CI regressions the same way other retrieval quality regressions surface. Composes with Dense-Repository Context Mode: when a domain profile is active (e.g. physics, signal-processing), the profile’s preserveRegex patterns boost sections containing matching text — the ## Invariants section containing epsilon_0 = 8.854e-12 always scores on physics profile turns, even if the user’s immediate query doesn’t say “epsilon.” UI: a SIDECAR.md index health line in the existing observability surface showing indexed sections + disk footprint + last-update-time; a /sidecarmd preview <query> slash command that runs a dry retrieval against an arbitrary query so users can debug why a section isn’t surfacing. Configured via sidecar.sidecarMd.mode: 'retrieval' (opt-in), sidecar.sidecarMd.retrieval.topK (default 5, clamped 1–20), sidecar.sidecarMd.retrieval.minScore (default 0.3 — sections below this threshold never surface even if they’re in top-K, prevents forced-include on doc-light projects), and sidecar.sidecarMd.retrieval.alwaysIncludeHeadings (shared with section-mode — these bypass retrieval scoring and inject verbatim). Roadmap slot: v0.70+ as a retrieval-infrastructure beat, after the path-scoped primitive has been in production long enough to measure the gap retrieval needs to close.
Scheduled Task Concurrency Safety — Shadow Routing + DocumentConcurrencyGate + Deferral Queue — closes the real failure mode where a sidecar.scheduledTasks entry fires mid-keystroke: a 2 AM lint/refactor task fires at noon while the developer is actively editing a target file, the agent writes through fs.ts tools directly against the main tree, and a WorkspaceEdit lands on a file with an unsaved buffer and an active cursor. The result is catastrophic UI stutter, cursor jump, lost edits if the user saves over the agent’s change — exactly the class of bug that destroys developer trust in background automation. Current state: scheduler.ts:46-70 runs tasks with approvalMode: 'autonomous', no Shadow Workspace routing, no dirty-buffer check anywhere in the path. The agent happily writes to src/foo.ts while the user is typing in src/foo.ts. This entry fixes it in three layers. (1) Force Shadow Workspace on every scheduled run: scheduler wraps the agent invocation with { forceShadow: true, deferPrompt: true } — reuses the v0.66 primitive Facets already use. Writes land in .sidecar/shadows/scheduled-<task-id>/ during the run; main tree is untouched for the entire 30-minute lint pass. The user’s active editor at <workspace>/src/foo.ts cannot possibly be affected by writes to <workspace>/.sidecar/shadows/<id>/src/foo.ts — different paths, different cwd, the TextDocument VS Code knows about doesn’t care. (2) DocumentConcurrencyGate at apply time: new primitive at src/agent/documentGate.ts exposes checkPathSafe(absolutePath): { safe: boolean; reason?: 'dirty' | 'active' | 'ok' }. Called once per touched path before the shadow’s diff is applied to main. Looks up the path in workspace.textDocuments.find(d => d.uri.fsPath === absolutePath) — if the document is open AND isDirty, returns { safe: false, reason: 'dirty' }. Separately checks window.activeTextEditor?.document.uri.fsPath === absolutePath — if the user is actively in that file, returns { safe: false, reason: 'active' }. Clean + inactive files apply immediately via the existing GitCLI.applyPatch path; deferred files queue. (3) Persistent deferral queue: .sidecar/scheduled/pending.jsonl (gitignored, append-only) stores { taskId, taskName, targetPath, pendingDiff, queuedAt, reason } per deferred entry. Survives extension reload and VS Code restart. On activation, the queue is replayed: for each entry, re-check checkPathSafe and apply if now safe. A onDidSaveTextDocument + onDidChangeActiveTextEditor listener fires the same re-check whenever an affected path becomes available — saves drain the queue silently without any foreground interrupt. Cross-task mutex per path: if task B completes with a pending diff for src/foo.ts while task A’s diff for the same file is still queued, the queue serializes them in queuedAt order — the new src/agent/lockPrimitives.ts module (extracted from the existing fileLock.ts FIFO pattern) gates the apply so task B applies onto the post-task-A result rather than racing. If task A’s apply fails (patch conflict with a user edit that landed in the interim), task B still gets its own attempt against the post-user-edit file; failures surface individually. Staleness surface: if a queued entry has been pending longer than sidecar.scheduledTasks.staleWarningMinutes (default 60, set to 0 to disable), a single notification surfaces (“2 scheduled task results are waiting on src/foo.ts. Review now?”) that opens the existing facet-review UI from v0.66 — it’s already diff-aware, per-file accept/reject, integrates with vscode.diff, and requires no new UI code. Task-level versus path-level deferral: if ANY target in the task’s diff is unsafe, the task’s diff is deferred whole — we don’t partial-apply a multi-file refactor because half the files were clean. Preserves atomicity; all-or-nothing matches how Audit Mode and Facets already reason about batches. Composes with Audit Mode: when Audit Mode is active, scheduled-task applies route through the audit buffer the same way agent writes do — the user sees “3 scheduled changes awaiting review” in the existing Audit tree, accept/reject applies or discards atomically. Composes with Regression Guards: pre-completion guards run against the shadow before the apply gate, so a guard that detects a broken invariant can block the scheduled task’s apply even when the path is clean and inactive — no “silent 2 AM commit that breaks the build.” Composes with Shadow Sweep: the existing v0.62.1 shadow-sweep already cleans abandoned .sidecar/shadows/ entries from crashed sessions; scheduled-task shadow IDs use the same scheduled-<task-name>-<timestamp> namespace so sweep handles them uniformly. Configured via sidecar.scheduledTasks.forceShadow (default true; escape hatch false restores pre-v0.69 direct-write behavior with a one-time warning), sidecar.scheduledTasks.gateOnDirty (default true), sidecar.scheduledTasks.gateOnActive (default true), sidecar.scheduledTasks.staleWarningMinutes (default 60), and sidecar.scheduledTasks.maxQueuedBatches (default 50; hard cap against runaway queue if a path stays dirty for days — oldest batches drop silently with an audit-log entry). Explicitly out of scope: per-character edit merging (that’s git’s job via applyPatch conflict detection), real-time collaborative editing (different problem class), cross-user scheduled-task coordination in team environments (different problem class), preemptive abort of a scheduled-task run when the user opens a target file mid-run (the shadow isolates writes already, preemption adds complexity for marginal benefit — let the run finish in its shadow, gate the apply).
Multi-repo cross-talk — impact analysis across dependent repositories via cross-repo symbol registry
Semantic Agentic Search for Monorepos — cross-repository memory backed by a dedicated MCP server that indexes multiple local folders simultaneously into a unified vector store. The agent can answer questions like “does the algorithm in repo-a match the implementation in repo-b?” by running a semantic diff across both indices, surfacing divergences, stale copies, and interface mismatches in a single response. Each root is indexed independently so adding or removing a repo doesn’t invalidate the others. Configured via sidecar.monorepoRoots (array of absolute paths) and exposed as a search_repos tool the agent calls automatically when a prompt references multiple packages. A Repo Index status-bar item shows live indexing progress per root.
Dense-Repository Context Mode — Domain Profiles + Invariant-Aware Retention — closes the gap that remains after graph-expanded retrieval ships in v0.65: for deeply-interconnected codebases like electromagnetics simulations, signal-processing engines, and extensive transform libraries, the agent needs not just “pull in the callers” but “keep the load-bearing constants, equations, and physical units from being evicted when the turn gets long.” Today’s compression layer (src/agent/loop/compression.ts) prunes toolresults and old turns by character count — zero awareness of whether a truncated line contained epsilon_0 used in twelve other files, the Maxwell-equation block that the next three write_file calls must stay consistent with, or the sample-rate constant that propagates through every DSP function. This entry introduces structural awareness to both retrieval and pruning. Domain profiles live as declarative markdown with frontmatter at .sidecar/profiles/<name>.md (opt-in, path configurable via sidecar.domainProfiles.registryPath, default .sidecar/profiles/); a profile declares retrieval policy (graphWalkDepth, prioritize globs for .m / .py / .tex / .cpp / .f90), invariant patterns to *preserve under pruning (preserveRegex: ["\\\\b(epsilon|mu|c|h|k_B)*?0?\\\\b", "\\\\\\\\frac\\\\{[^}]+\\\\}", "const\\\\s+\\\\w+\\\\s*=\\\\s*[0-9]"]), kind priorities (physics.mdbooststype, function, const; signal.mdboostsfunctionwith names matchingfft|dct|dwt|filter|transform), and token-budget hints (reservedForInvariants: 500— a floor carved out of the retrieval budget so invariant lines always get a seat even when the rest of context is hot). Built-in profiles ship forphysics, signal-processing, transforms-and-kernels, numerical-methods, and control-systems; users copy and customize under the same directory. Activated per-workspace via sidecar.domainProfiles.active(string array — profiles compose, e.g.["signal-processing", "physics"]for an EM-simulation repo) or per-prompt via@profile:physicssentinel. Symbol-level importance score layered onto the existing Project Knowledge Index: every symbol gets a precomputed importance value from(fanIn × 0.4) + (referenceCount × 0.3) + (matchesPreserveRegex × 0.3), persisted in the Merkle store next to the embedding. High-importance symbols are exempt from low-priority eviction. When compression needs to free N chars from a tool_result or code snippet, it reads importance scores for every line’s containing symbol and elides the lowest-scoring first — a tool_result containing epsilon_0 = 8.854e-12stays; the surrounding debug print statements drop. Invariant-aware summarization extends ConversationSummarizer: when an old turn references a preserved-regex hit (say, the Maxwell-equation block), the summarizer replaces the surrounding prose but keeps the equation verbatim as a quoted block. The summary reads “In turn 3, we discussed the divergence of E; the form referenced was:∇·E = ρ/ε₀.” Model sees the summary AND the exact invariant — no drift. Small-context adaptation is the scenario this was designed for: on a 4K local model where every token counts, domain profiles become more valuable, not less, because the profile’s reservedForInvariantsfloor converts “random character truncation” into “keep the physics, drop the narration.” The retriever + pruner consult profile config whenevercontextLength < 16Kand tighten the filtering accordingly. Reference graph surfacing surfaces cross-file numeric-constant coupling as first-class hits: a newfind_shared_constants(symbol)agent tool walks the symbol graph plus a lightweight constant-use index (maintained by a tree-sitter visitor that flagsconst/static const/final/Parameterdeclarations), and returns every file that depends on a specific named value — so “before you changeSAMPLE_RATE, here are the 12 files that use it” becomes a pre-edit check the agent runs automatically when edit_file targets a file matching preserveRegex. Cross-invariant validation at completion-gate time: a new post-turn hook extracts numeric literals and named constants from every write_file/edit_file the turn produced, cross-references them against the invariant set, and flags divergence (“MU_0declared as1.257e-6infields.pyline 12 but1.256e-6inwaves.pyline 38 — which is correct?”). Guards against the class of physics/math bug where two “agreeing” files silently disagree on the fourth decimal. Composes with every earlier retrieval entry: Project Knowledge Index is where the importance scores live + the Merkle layer reuses them as extra metadata for subtree selection; Memory Guardrails becomes “pin the profile’s invariant set by default” rather than manually picking constants; Semantic Time Travel answers “when didepsilon_0last change?” in O(diff) via Merkle; Multi-repo cross-talk checks for constant agreement ACROSS repos (sameGRAVITYvalue inplanetary_sim/andorbit_mechanics/? the tool flags the drift). Profile discovery: /profile suggestanalyzes the workspace (file extensions, import graph heuristics, presence of.tex/numpy/scipy/eigen), surfaces the top 1-3 matching built-in profiles with a one-click accept, writes the chosen profile(s) to sidecar.domainProfiles.active, and begins tracking. Output verbosity: the Retrieved Context block in the system prompt gains a “Preserved by domain profile” section tagged [profile: physics]so the model sees which lines are invariant-floor vs. standard retrieval hits. Configured viasidecar.domainProfiles.enabled(defaultfalse— opt-in per workspace; activating a profile auto-flips this),sidecar.domainProfiles.registryPath(default.sidecar/profiles/), sidecar.domainProfiles.active(string array — profiles compose in declared order, later profiles override earlier on conflict),sidecar.domainProfiles.autoDetect(defaulttrue— on first activation, run/profile suggestand prompt the user),sidecar.domainProfiles.reservedForInvariantsFloor(override floor applied to every active profile, default0= use each profile’s own value),sidecar.domainProfiles.crossInvariantValidation(defaulttrue), and sidecar.domainProfiles.sharedConstantsTool(defaulttrue). Pairs naturally with the v0.65-shipped graph-expanded retrieval (which this entry treats as the foundation) and the Project Knowledge Index’s Merkle + importance scoring.

Project Knowledge Index — Symbol-Level Vectors + Graph Fusion in an On-Disk Vector DB — upgrades the shipped EmbeddingIndex (which today stores one 384-dim all-MiniLM-L6-v2 vector per file in a flat Float32Array at .sidecar/cache/embeddings.bin with a JSON metadata sidecar and a linear cosine scan at query time) into a Pro-grade codebase intelligence layer that stays entirely on disk, answers global questions, and models relationships — not just text matches. The gap this closes is best illustrated by the canonical repo-awareness question “where is the auth logic handled?”: the current flat index returns files whose text happens to mention “auth” somewhere, which usually means the middleware file is found but the routes that use it without saying “auth” are missed, and on a 10k-file repo the linear scan is slow enough to be noticeable. Copilot Pro answers this well because it indexes at symbol granularity and understands the call graph; this entry brings the same capability on-disk and local-first. Three layered changes: (1) Proper on-disk vector store via embedded LanceDB — a Rust-native columnar vector DB with a Node binding, HNSW indexes for sub-ms ANN over millions of vectors, metadata filtering (query “auth” only inside src/middleware/**), atomic writes, and zero external processes to manage. LanceDB is chosen over ChromaDB because Chroma’s Node support goes through a Python subprocess, which is a deployment footgun in a VS Code extension; LanceDB ships as a single .node binary with no runtime dependencies. Storage lives at .sidecar/cache/lance/ (already covered by the gitignored-subdirs carve-out). (2) Symbol-level chunking replaces one-vector-per-file — every function, class, method, interface, and significant top-level comment block becomes its own indexed chunk. The existing symbolGraph.ts already runs tree-sitter over the workspace and knows symbol boundaries, so it becomes the chunker: each SymbolNode produces one vector from its body text plus docstring, tagged with { filePath, range, kind, name, containerSymbol, hash }. Granularity goes from thousands of file-vectors to hundreds of thousands of symbol-vectors; retrieval returns the specific function, not the whole file. (3) Graph-walk retrieval closes the “middleware vs routes” gap — after the initial vector hit, the retriever walks the symbol graph’s typed edges (defines, calls, imports, used-by) up to sidecar.projectKnowledge.graphWalkDepth (default 2) and surfaces symbols reachable from the hit even when their text doesn’t match the query. So “where is auth handled?” retrieves requireAuth middleware via vector similarity, then walks the used-by edges to return every route handler that wraps it — without those routes needing to say the word “auth.” The walk is budgeted (breadth-first up to maxGraphHits, default 10) so a popular symbol like logger.info can’t drown the result list. Incremental updates: VS Code’s onDidChangeTextDocument / onDidCreateFiles / onDidDeleteFiles / onDidRenameFiles events drive re-embedding of only the changed symbols (not the whole file), resolved by content-hashing each symbol’s body — unchanged symbols keep their cached vectors so a one-line edit in a 2000-line file costs one re-embed, not 200. Rename events move the vector metadata instead of re-embedding. A background queue with 500ms debounce + 30s persist-to-disk matches the existing pattern at embeddingIndex.ts:24-25. New agent tool: project_knowledge_search(query, { maxHits?, graphWalkDepth?, kindFilter?, pathGlob? }) returns structured { symbol, filePath, range, score, relationship }[] with relationship tagging whether each hit was a direct vector match or reached via graph walk ("vector: 0.82", "graph: used-by → 2 hops from requireAuth"), so the model sees why each result surfaced and can weight accordingly. Migration from the flat index is transparent: on first activation with the new backend, the existing .sidecar/cache/embeddings.bin is read, re-chunked to symbol-level, and ingested into LanceDB; the old file is kept for one version as a rollback safety net, then deleted. UI: a Project Knowledge sidebar panel shows index health (symbols indexed, last update time, vector count, disk footprint), a rebuild-from-scratch button for pathological cache states, and a search box that exposes the same project_knowledge_search tool for the user to query interactively. Composes with every earlier retrieval entry: SemanticRetriever in the fusion pipeline now queries symbols rather than files (hits are smaller and more precise, so RRF competes them more fairly against doc and memory hits); Semantic Time Travel uses per-commit LanceDB snapshots at .sidecar/cache/lance/history/<sha>/; Memory Guardrails pins entries go in the same store with a pinned: true metadata flag and a filter that always includes them regardless of score; the Semantic Agentic Search for Monorepos entry becomes “N LanceDB tables queried in parallel” — same code path, different roots. Configured via sidecar.projectKnowledge.enabled (default true), sidecar.projectKnowledge.backend (lance flat, default lance; flat preserves the current behavior for users on constrained platforms where the native binding won’t load), sidecar.projectKnowledge.chunking (symbol file, default symbol), sidecar.projectKnowledge.graphWalkDepth (default 2), sidecar.projectKnowledge.maxGraphHits (default 10), sidecar.projectKnowledge.indexPath (default .sidecar/cache/lance/), sidecar.projectKnowledge.maxSymbolsPerFile (default 500 — guard against generated files with 50k symbols), and sidecar.projectKnowledge.embedOnSave (default true; set false for manual rebuild only).

flowchart TD
    Q[Query: 'where is auth handled?'] --> E[Embed query<br/>all-MiniLM-L6-v2]
    E --> ANN[LanceDB HNSW search<br/>sub-ms ANN over<br/>symbol vectors]
    ANN --> V[Vector hits<br/>e.g. requireAuth middleware]
    V --> GW{Graph walk<br/>depth ≤ 2}
    GW -->|used-by edges| R1[Route handlers<br/>wrapping requireAuth]
    GW -->|calls edges| R2[Called helpers<br/>verifyToken, etc.]
    GW -->|imports edges| R3[Modules importing<br/>the middleware]
    V & R1 & R2 & R3 --> RANK[Rank + tag by<br/>relationship path]
    RANK --> OUT[Structured hits:<br/>symbol, filePath, range,<br/>score, relationship]

    subgraph Updates
        W[onDidChangeTextDocument] --> H[Hash changed symbols]
        H --> D{Diff vs cached}
        D -->|changed| RE[Re-embed only<br/>changed symbols]
        D -->|unchanged| SKIP[Keep cached vector]
        RE --> UP[Atomic upsert<br/>to LanceDB]
    end

Merkle-Addressed Semantic Fingerprint — Keystroke-Live Structural Index — layers a content-addressed Merkle tree over the Project Knowledge Index so change detection, integrity verification, and sync across sessions/machines become O(log n) instead of O(n), and re-embedding on a per-file save compresses to re-hashing on a per-keystroke basis with no latency cost. Current state honestly: EmbeddingIndex runs a 500ms debounced incremental update on onDidChangeTextDocument, re-embeds the whole file each time, persists as a flat binary every 30s. Works, but two things fall out of this: (a) large monorepos pay an index-walk cost for every query because there’s no hierarchy to prune with, and (b) “what changed since you were last here?” requires re-hashing everything because nothing is addressed structurally. This entry adds a Merkle layer that makes both of those sub-linear. The structure: every symbol-level chunk (already the granularity proposed in Project Knowledge Index) becomes a Merkle leaf with a content hash blake3(body ‖ path ‖ kind ‖ range) and its embedding as leaf metadata. Interior nodes aggregate their children’s hashes (blake3(child1 ‖ child2 ‖ …)) and also carry a mean-pooled aggregated embedding of their subtree, so the retriever can score whole subtrees at the interior level and skip them entirely without touching the leaves. The root hash is the repository’s semantic fingerprint — a single 32-byte string that changes iff any symbol in the workspace changed. Keystroke-live updates: VS Code’s onDidChangeTextDocument fires on every edit with the modified ranges; the Merkle layer intercepts this and does the cheap work (re-hashing the containing symbol’s leaf, then the O(log n) parent chain up to the root) on every keystroke with no debounce — blake3 is fast enough that a 100-file-deep hash walk finishes in well under a millisecond. The expensive work (re-embedding) stays on a 300ms debounce because embedding is what actually takes ~20-50ms per chunk on-device — so the Merkle state is always current, the embedding state is eventually consistent within ~300ms, and the retriever can distinguish “this subtree is stale” (hash changed but embed hasn’t caught up — score with last-known embed, flag as stale: true for honest UX) from “this subtree is fresh.” Where the latency comes from on a large monorepo — at query time the retriever walks down the tree: compute query embedding, compare against each of the root’s direct children’s aggregated embeddings, descend into the top-k subtrees, recurse. A workspace with 500k symbols becomes ~20 interior-level comparisons to narrow down to the top ~2k leaves, then an HNSW ANN search over those 2k (sub-ms in LanceDB). Total end-to-end latency: ~10–30ms on typical hardware even against a million-symbol index, which is the regime where “find that function three folders away” starts to feel instant rather than noticeable. Cache validity and sync become trivial byproducts of the root hash: on startup, SideCar recomputes the root over the current disk state (fast — just content hashes, no embeddings) and compares to the cached root; if they match, the whole index is reused as-is (no rebuild); if they differ, a tree walk finds exactly the changed subtrees and only those are re-embedded. The same mechanism gives cross-machine parity at trivial cost — Multi-User Agent Shadows’ shadow.json can include the Merkle root, so a teammate’s instance verifies index alignment in one 32-byte comparison and requests only the diff subtrees if misaligned. For Semantic Time Travel per-commit snapshots, unchanged subtrees dedup automatically (two commits that differ only in src/utils/foo.ts share every other subtree hash and therefore every other subtree’s cached embeddings) — a git-like compression ratio on the snapshot store without any custom encoding work. Lineage queries (/diff-since <commit-or-timestamp>) become a Merkle diff: two roots, descend into subtrees whose hashes differ, return the symbol-level changes — answerable in O(differences) rather than O(repo size), which is what makes “what changed since I was last here?” feel instant in sessions that span weeks. ~200-272k token context-window utilization: a frontier-model context window of this size is big enough to fit a small project outright, but for a 500k-symbol monorepo even 272k tokens is maybe 2% of the repo by token count, so the retriever’s job is to pick the 2% that matters. Merkle-addressed aggregated embeddings at interior nodes let the retriever select the most relevant subtrees first and materialize exactly as many as the context budget allows, with provably correct “you got the top-k subtrees for your budget” semantics rather than the current best-effort flat scan. Near-zero latency doesn’t come from precomputation alone — it comes from not having to walk most of the tree per query. Storage layout (.sidecar/cache/merkle/, covered by the gitignored-subdirs carve-out): tree.bin for the structure (parent/child pointers + hashes, mmapped), embeddings.lance/ for leaf and interior-node vectors (the same LanceDB store from Project Knowledge Index, now with an extra level: 0|1|2|… metadata column for interior-node rows), roots.log for an append-only history of root hashes with timestamps so time-travel queries work without keeping full per-commit trees. Live root hash persists to roots.log every sidecar.merkleIndex.rootSnapshotEveryMs (default 10000, 10s) so a crash loses at most that interval of lineage data — the Merkle state itself is rebuildable from disk in ~seconds for any repo size. Integration with every earlier entry: Project Knowledge Index becomes the similarity layer and Merkle becomes the addressing layer (they compose — Merkle narrows candidate subtrees, LanceDB HNSW ranks within them); Semantic Time Travel stores per-commit roots instead of per-commit full indexes (dedup-heavy; a 500-commit history costs ~the same as 10 if the churn is low); Multi-User Agent Shadows syncs Merkle roots in shadow.json for team index parity; Fork & Parallel Solve shows root-diff between forks as a structural summary of “what did each fork actually change” alongside the file diff; Model Routing can gate on change velocity (symbols under a high-churn subtree escalate to a more thorough model); Regression Guards can be targeted by subtree (a physics guard only fires when the touched symbols’ Merkle path contains src/physics/**). Configured via sidecar.merkleIndex.enabled (default true when Project Knowledge is enabled — they’re architecturally coupled), sidecar.merkleIndex.hashAlgorithm (blake3 default for speed, sha256 fallback for environments without a blake3 binding), sidecar.merkleIndex.liveUpdates (default true — hash on keystroke; set false to match the current 500ms-debounce-on-save behavior), sidecar.merkleIndex.rootSnapshotEveryMs (default 10000), sidecar.merkleIndex.aggregationStrategy (mean-pool max-pool attention-pool, default mean-pool — attention-pool is future work needing a trained head; mean-pool is the boring-and-correct default), and sidecar.merkleIndex.maxSymbolsForLiveHash (default 50000 — above this, fall back to debounced updates even in live mode because keystroke-rate hashing of a 500k-leaf tree becomes non-trivial even at blake3 speeds).

flowchart TD
    subgraph Tree ["Merkle tree of symbols"]
        R["Root hash<br/>blake3 + mean-pooled<br/>aggregated embedding"]
        R --> S1["Subtree src/<br/>hash + agg-embed"]
        R --> S2["Subtree tests/<br/>hash + agg-embed"]
        S1 --> F1["File hash<br/>agg of symbols"]
        S1 --> F2["File hash<br/>agg of symbols"]
        F1 --> L1["Leaf: function authN<br/>content hash +<br/>384-dim embedding"]
        F1 --> L2["Leaf: class AuthMiddleware<br/>..."]
    end
    K[User keystroke] --> UL[Re-hash modified leaf]
    UL --> UP[Walk up O(log n)<br/>update ancestor hashes]
    UP --> R
    UL -.debounced 300ms.-> EM[Re-embed leaf]
    EM --> AGG[Recompute aggregated<br/>embeddings on ancestor path]

    subgraph Query ["Query path"]
        Q[Query embedding] --> QR[Compare vs root's<br/>direct children]
        QR --> DESC[Descend top-k subtrees]
        DESC --> HNSW[HNSW ANN<br/>over narrowed leaves]
        HNSW --> HITS[Ranked symbol hits]
    end

Editing & Code Quality

Inline edit enhancement — ghost text overlay for single-cursor edits shipped in v1.0 (src/edits/inlineEditProvider.ts). Remaining: extend to write_file multi-file edits, per-hunk syntax-highlighted diff, and batched edit streams. Deferred post-v1.0.
Selective regeneration — “pin and regen” UI: lock good sections, regenerate only unlocked portions

Multi-File Edit Streams — DAG-Dispatched Parallel Writes — closes the Copilot-free-vs-Pro gap on wide refactors by letting the agent stream changes across N files at once instead of serializing them one at a time. The current loop already batches multiple tool_use blocks within a single assistant turn (the model can emit write_file src/a.ts + write_file src/b.ts in one message and executeToolUses dispatches them together), but two gaps stop this from feeling like Pro-grade multi-file editing: (1) the agent rarely plans a multi-file edit up front — it tends to edit one file, wait to see the result, then decide the next edit, which serializes execution even when the edits are logically independent; and (2) the UI streams one diff preview at a time rather than N in parallel, so even batched writes feel sequential to the user. This entry addresses both. Up-front edit planning: when a task is large enough (sidecar.multiFileEdits.minFilesForPlan, default 3), the loop inserts a mandatory Edit Plan pass before any write_file fires. The planner agent produces a typed manifest — EditPlan { edits: { path, op: 'create' | 'edit' | 'delete', rationale, dependsOn: path[] }[] } — and the runtime builds a DAG from the dependsOn edges. Independent nodes run in parallel up to sidecar.multiFileEdits.maxParallel (default 8); edits with dependencies wait for their prerequisites (rename a symbol’s definition before editing the call sites). The plan surfaces in the chat UI as a collapsible Planned edits card the user can inspect — and amend via Steer Queue nudges like “skip src/legacy/, I’ll do those manually” — before execution starts, so the scope is transparent up front instead of discovered one file at a time. **Parallel streaming diff previews: the webview’s existing streamingDiffPreviewFn path is extended to handle N concurrent streams. A Pending Changes panel tile renders per in-flight file with its own live diff, chars-streamed progress bar, and per-file abort button; on an 8-wide edit the user sees all eight files populate simultaneously rather than watching them tick through one by one. Conflict detection at plan time, not write time — the DAG builder rejects plans with two edit ops targeting the same file (merged into one op with combined rationale) or with circular dependencies (the planner is asked to revise once, then surfaced as an error). Atomic review semantics: by default, the Pending Changes panel treats a multi-file plan as a single unit of work — accepting one file without the others can leave the codebase in a broken intermediate state (renamed definition + unrenamed call sites), so the default is accept-all or reject-all. Two escapes: sidecar.multiFileEdits.reviewGranularity set to per-file exposes individual file checkboxes for advanced users who want surgical control, and per-hunk drops down to hunk-level even across files. Integration with every earlier feature: all N streams land in the Shadow Workspace, so the main tree sees only the final bulk merge regardless of how many files are in flight; Regression Guards fire once against the full edit set rather than per-file, which is often what the user actually wants (a guard that only makes sense after the whole rename is done shouldn’t fail N−1 times during intermediate states); Audit Mode’s treeview shows N parallel buffered writes with per-file checkboxes matching the same granularity setting; Fork & Parallel Solve lets each fork contain its own multi-file plan for side-by-side comparison of wide-refactor strategies; Skills 2.0 can cap multi-file fanout per skill (a narrow test_author skill might set max-parallel-edits: 1 in its tool-budget). **Planning-pass cost** — adds one extra LLM turn before edits start, so the feature is opt-out-able when the user knows better (@no-plan sentinel in the prompt skips the planner), and the planner can reuse a small local model via sidecar.multiFileEdits.plannerModel (default falls back to main model) since planning is structured-output-heavy and doesn’t need the full reasoning budget of the editing model. Configured via sidecar.multiFileEdits.enabled (default true), sidecar.multiFileEdits.maxParallel (default 8), sidecar.multiFileEdits.planningPass (default true), sidecar.multiFileEdits.minFilesForPlan (default 3 — skip the planner for small edits), sidecar.multiFileEdits.plannerModel (default empty — reuses main model), and sidecar.multiFileEdits.reviewGranularity (bulk per-file per-hunk, default per-file).

flowchart TD
    U[User task<br/>span > 3 files] --> PL[Edit Plan pass<br/>planner model]
    PL --> PLAN[EditPlan manifest<br/>edits + dependsOn DAG]
    PLAN --> CARD[Planned edits card<br/>in chat UI]
    CARD -->|User nudge via Steer Queue| PL
    CARD -->|OK to proceed| DAG[Topological schedule]
    DAG --> PAR[Dispatch independent nodes<br/>up to maxParallel]
    PAR --> S1[write_file src/a.ts]
    PAR --> S2[write_file src/b.ts]
    PAR --> SN[...up to 8 streams]
    S1 & S2 & SN --> PC[Pending Changes panel<br/>N parallel diff previews]
    PC --> DEP[Dependent nodes fire<br/>after prereqs land]
    DEP --> PC
    PC --> GATE{Gate + Guards<br/>against full edit set}
    GATE -->|green| REV[Review: bulk /<br/>per-file / per-hunk]
    GATE -->|red| FB[Feedback to agent<br/>+ refine plan]
    REV --> M[Atomic merge to shadow]

Zero-Latency Local Autocomplete via Speculative Decoding — pairs a tiny “draft” model (≤300M params, e.g. qwen2.5-coder:0.5b, deepseek-coder:1.3b-distill, or the new generation of sub-B code drafts) with the user’s main FIM model (typically 7B–30B) and runs speculative decoding on the two in lockstep, amortizing the cost of the big model’s forward pass across k draft tokens per step. The existing completeFIM path at client.ts:286 and InlineCompletionProvider at completions/provider.ts:79 stream the result straight into VS Code’s ghost-text surface; today this runs the big model alone and inherits its raw tok/s. With a well-matched draft pair on decent local hardware (RTX 4090 / M3 Max / 128GB+ unified memory), empirically observed speedups are 2–4× on code continuations where the draft’s guesses agree with the target most of the time — pushing a 30B coder from ~30 tok/s to ~80–120 tok/s, which crosses the perception threshold from “noticeably waiting” to “appearing as you type.” Target UX: autocomplete that feels like Copilot / Cursor Pro without the round-trip to a cloud provider and without ongoing token spend. Mechanism: draft generates k candidate tokens serially (cheap — the small model runs in microseconds per token), target verifies all k in a single parallel forward pass (one big-model step cost covers k tokens of throughput), accept the longest prefix where target’s argmax matches draft’s proposal, use the target’s token at the first disagreement, discard the rest of the draft. Rejection-sampled variant is supported for temperature>0 but default is greedy since autocomplete wants determinism. Backend integration: Ollama and Kickstand both back onto llama.cpp, which has native speculative decoding support (--draft-model, --draft parameters); the path is to surface this through the backend abstraction as a new optional draftModel field on SideCarConfig, have OllamaBackend.completeFIM pass draft_model to /api/generate when set, and have KickstandBackend.completeFIM pass the equivalent to its OAI-compat endpoint. For backends that don’t expose speculative decoding (Anthropic, OpenAI, remote OpenAI-compatible that haven’t enabled it), the setting is a silent no-op and completion runs target-only — no breakage, no warnings. Model pairing: a curated DRAFT_MODEL_MAP ships with sensible defaults (qwen3-coder:30b → qwen2.5-coder:0.5b, deepseek-coder:33b → deepseek-coder:1.3b-base, codellama:34b → codellama:7b-code) so users who just select a big model from the picker get the speedup automatically if the draft is installed, with a one-click “install recommended draft” affordance if not. Tokenizer compatibility is a hard requirement (same family, same vocab) — the map only pairs models known to share tokenizers, and manual overrides that violate this are rejected with a specific error rather than producing garbled output. VRAM guardrails — running two models costs memory; integrates with the GPU-Aware Load Balancing roadmap entry so if VRAM headroom drops below the threshold while a big training job is going, speculative mode auto-disables and falls back to target-only rather than crashing. FIM prompt format carries through unchanged — the existing <|fim_prefix|> / <|fim_suffix|> / <|fim_middle|> delimiters are respected by both models in a matched pair. Configured via sidecar.speculativeDecoding.enabled (default true when a draft mapping exists for the active model, false otherwise — zero-config for the common case), sidecar.completionDraftModel (explicit override, falls back to the curated map), sidecar.speculativeDecoding.lookahead (default 5 — number of draft tokens per verification step; higher = more speedup when draft is accurate, lower = less wasted compute when draft is wrong), sidecar.speculativeDecoding.temperature (default 0 — greedy; raise for rejection-sampled generation if autocomplete gains feel stale), and sidecar.speculativeDecoding.minAcceptRateToKeepEnabled (default 0.4 — if observed accept rate drops below this after a warmup window, disable speculation automatically because the draft isn’t earning its keep and is just burning compute).
```
sequenceDiagram
 participant E as Editor (ghost text)
 participant P as InlineCompletionProvider
 participant D as Draft model (0.5B)
 participant T as Target model (30B)

 E->>P: completion trigger (debounced)
 P->>P: build FIM prompt (prefix + suffix)
 loop Speculative step
 P->>D: generate k tokens (fast, serial)
 D-->>P: [t1, t2, ..., tk]
 P->>T: verify [t1..tk] in one parallel forward pass
 T-->>P: logits for each position
 P->>P: accept longest matching prefix, replace first mismatch
 end
 P-->>E: stream accepted tokens as ghost text
 Note over P: Typical accept rate 60-80% → 2-4× throughput vs target alone
```

Agent Capabilities

Chat threads and branching — parallel branches, named threads, thread picker, per-thread persistence
Persistent executive function — multi-day task state in .sidecar/plans/ tracking progress, decisions, and blockers across sessions

First-Class Skills 2.0 — Typed Personas with Tool Allowlists, Preferred Models, and Composition — upgrades the shipped SkillLoader from “inject markdown into the prompt” into a full persona system where each .agent.md (or existing .md) skill is a declarative contract the runtime actually enforces. The parser at skillLoader.ts:54 already reads — but silently ignores — Claude-Code-compatible frontmatter fields (allowed-tools, disable-model-invocation); this entry makes every one of those fields load-bearing and adds several more. Enforced frontmatter schema:

---
name: Git Expert
description: Focused git workflow assistance
scope: session # turn | task | session — how long the skill stays active
allowed-tools: [git_status, git_diff, git_log, git_commit, git_branch, git_push, read_file]
preferred-model: claude-sonnet-4-6 # switch to this model while active; restore on exit
system-prompt-override: false # false = append to base prompt, true = replace it entirely
disable-model-invocation: false # when true, only the user can invoke — model can't auto-select
extends: base-coder # inherit frontmatter + prompt from another skill
variables: # user-supplied args at invocation
  branch: { description: Target branch, required: false }
  message: { description: Commit message, required: false }
auto-context: # auto-inject these tool calls' output as starting context
  - git_status
  - git log -n 10
guards: [branch-protection] # Regression Guards that activate with this skill
tool-budget: # per-tool call caps while this skill is active
  git_commit: 3
---

Each field maps to a concrete runtime behavior: allowed-tools intersects with the current toolPermissions map (most restrictive wins) so /git_expert literally cannot call write_file or run_shell_command regardless of the ambient mode — principle of least privilege per skill, turning a db-writer skill into a real capability boundary and not just an advisory one. preferred-model triggers a scoped updateModel() swap for the skill’s duration; on exit the previous model restores (exceptions revert too, no sticky-state bugs). system-prompt-override: true fully replaces the base prompt with the skill’s content for the hardest personality lock — useful when you want latex_writer to be a LaTeX-only assistant with no inherited general-coder instincts; default false keeps the existing append-as-context behavior for backward compatibility. disable-model-invocation prevents injection-style skill abuse where a hostile file could prompt the model into silently activating a privileged skill — the skill is user-invocation-only. extends gives single-inheritance composition: frontend.agent.md extends base-coder and inherits its tool allowlist + prompt preamble, overriding or extending per-field. variables are resolved at invocation (/git_expert branch=feature/foo) and substituted into the prompt as ${branch} — Claude Code’s $ARGUMENTS convention is also accepted as an alias. auto-context runs a fixed set of read-only tool calls before the skill’s first turn so the model sees pre-fetched state (the git_expert skill always starts with current git status + last 10 commits in its context, no wasteful first-turn git_status call). guards registers per-skill Regression Guards that activate only while the skill is in effect. tool-budget caps per-tool calls (prevents a runaway skill from calling git_commit 50 times). Skill stacking: users can invoke multiple skills simultaneously via /with git_expert /with technical-writer <task> or a persistent stack via the UI picker. Tool allowlists intersect (git_expert ∩ technical-writer = only tools both permit); preferred-model conflicts resolve by last-invoked-wins with a visible indicator; prompts concatenate in stack order with section headers so the model sees the layered persona clearly. Scopes: turn skills apply for exactly one user turn and revert; task skills persist until the current task’s completion gate passes; session skills persist until explicitly ended with /unload <skill> or a new session starts — matches the mental model users already have from similar systems. Skills Picker UI: a new sidebar panel replaces “type the slash command and hope you remember the name” with a searchable grid of available skills — tagged by category (git / frontend / security / scientific / writing), preview of the persona’s opening instructions, the tool allowlist rendered as chips, and a Stack button to add without replacing. Telemetry (local-only, opt-in): per-skill usage count, average turns-to-completion, accept rate of the skill’s proposed changes — surfaced in the picker so users can see which skills are earning their keep and which are dead weight. Integration with every earlier entry: Facets consume skills via their existing skillBundle field (a facet stacks its declared skills automatically on dispatch); Fork & Parallel Solve can wear different skills per fork (fork A with fourier_approach.agent.md, fork B with wavelet_approach.agent.md); Regression Guards declared in skill frontmatter fire only while the skill is active; Audit Mode can be required by a skill (require-audit: true) for write-heavy skills; Visual Verification criteria can be declared per-skill. Backward compatibility: every field is optional — the 8 shipped skills (break-this, create-skill, debug, explain-code, mcp-builder, refactor, review-code, write-tests) keep working unchanged since they declare none of the new fields; missing fields default to the current permissive behavior (full tool access, append-mode prompt, turn-scoped). Configured via sidecar.skills.directories (already exists — extends to accept both .md and .agent.md), sidecar.skills.enforceAllowedTools (default true; false for legacy “advisory only” parsing), sidecar.skills.allowModelInvocation (default true; when false only user-initiated invocation is ever honored, even for skills that don’t declare disable-model-invocation), and sidecar.skills.stackingMode (strict union last-wins, default strict — strict intersects tool allowlists; union takes the superset; last-wins replaces prior skills entirely).

flowchart TD
    U[User invokes /git_expert] --> L[SkillLoader resolves +<br/>merges extended skills]
    L --> FM{Frontmatter fields}
    FM --> AT[allowed-tools →<br/>intersect with toolPermissions]
    FM --> PM[preferred-model →<br/>scoped updateModel]
    FM --> SP[system-prompt-override →<br/>replace or append]
    FM --> V[variables → substitute<br/>user args into prompt]
    FM --> AC[auto-context →<br/>pre-fetch read-only tool output]
    FM --> G[guards → register on<br/>HookBus for skill lifetime]
    FM --> TB[tool-budget →<br/>per-skill call caps]
    AT & PM & SP & V & AC & G & TB --> ACT[Skill active]
    ACT --> SCOPE{scope}
    SCOPE -->|turn| T1[Revert after 1 turn]
    SCOPE -->|task| T2[Revert when gate<br/>closes cleanly]
    SCOPE -->|session| T3[Revert on /unload<br/>or session end]
    T1 & T2 & T3 --> REV[Restore prior model,<br/>tool perms, hooks]

Skill Sync & Registry — Git-Native Distribution Across Machines and Projects — extends Skills 2.0 from “manually drop .agent.md files in each project’s .sidecar/skills/ or ~/.claude/commands/” to a proper three-tier distribution model matching Copilot Pro / Cursor’s global agent registry, but git-native and local-first so no SideCar-operated service stands between you and your skills. The three tiers, from smallest blast radius to largest, are already partially supported or genuinely new: (1) Project-level team sync is already solved — per the Multi-User Agent Shadows .gitignore carve-out, .sidecar/skills/ at the project root stays tracked in git; teams that commit skills there get cross-developer sync for free via the main repo’s history. No new feature needed at this tier, but this entry documents it as first-class. (2) User-level cross-machine sync is the real gap — ~/.claude/commands/*.md works on one machine, but moving to a second laptop or a new dev container means copying files by hand. SideCar gains sidecar.skills.userRegistry, a git URL (or a local folder) the user owns: on activation, SideCar clones or pulls that repo into ~/.sidecar/user-skills/, the SkillLoader picks up every .agent.md inside as a user-scope skill, and the “Create Skill” flow offers a Publish to your registry checkbox that writes the new skill into the clone + commits + pushes. Standard git auth (SSH keys, GitHub tokens) handles credentials — no custom auth plumbing. A sidecar.skills.autoPull schedule (on-start hourly daily manual, default on-start) keeps the clone fresh; conflicts surface as notifications pointing to the managed directory for manual merge rather than being silently swallowed. (3) Team-scoped additional registries layer on top — sidecar.skills.teamRegistries accepts an array of git URLs, each cloned into a separate subdirectory of ~/.sidecar/team-skills/<registry-slug>/, with the Skills Picker tagging hits by origin registry so a developer on three overlapping teams can see which registry each skill came from and resolve name collisions deterministically (explicit registry prefix: /team-a/db-expert vs /team-b/db-expert). (4) Public marketplace is an optional fourth tier — a lightweight hosted index at registry.sidecar.ai (or any compatible endpoint via sidecar.skills.marketplace) that crawls opted-in public git repos, exposes search / tags / author / install-count metadata, and the Skills Picker’s Browse tab queries it at the user’s request. Installing from the marketplace still does a standard git clone into a managed location — the registry is just an index, not a runtime dependency, so if it goes down your installed skills keep working and future installs fall back to direct git URLs. Skill metadata for distribution extends the Skills 2.0 frontmatter with: version: 1.2.0 (semver, for pinning and update notifications); author: @user (renders in the picker, links to their registry); repository: https://github.com/user/skill-repo (source-of-truth URL for updates); license: MIT (surfaced in the picker so users see the legal posture before invoking); tags: [git, automation] (for marketplace filtering); requires: [@core/base-coder@^1.0] (inter-skill deps resolved transitively at install time). Versioning and pinning: sidecar.skills.versions accepts a map of { "@user/skill-name": "1.2.0" } pins; the Skills Picker shows an Update available badge when a newer version exists upstream but never auto-updates a pinned skill without the user’s explicit OK. Trust model is explicit: sidecar.skills.trustedRegistries lists registries that install without prompting; any other registry (including first-use of the public marketplace) prompts with “this skill will be allowed to suggest tool calls and prompt injections to your agent — review the source at ?" on first install, with the skill's full frontmatter + body shown inline. Skills still respect the `allowed-tools` and `disable-model-invocation` guardrails from Skills 2.0, so even an untrusted skill can't silently escalate beyond its declared tool surface — the trust prompt is about the _intent_ of the skill's prose, not about bypassing runtime enforcement. **Offline is a first-class mode**: once a skill is cloned, it works without network, the registry API is optional at runtime, and `sidecar.skills.offline` (default `false`) hard-disables every network operation — the extension becomes a pure local-cache reader, useful in air-gapped environments or in restrictive CI. **Integrates with every earlier feature**: Facets can reference skills via the same `@user/skill-name` identifier their `skillBundle` already uses, and the resolver fetches missing skills on first facet dispatch; Fork & Parallel Solve can pull different skill versions per fork (`fork A uses @core/refactor@1.0`, `fork B uses @core/refactor@2.0` — direct A/B test of a skill upgrade against real code); Project Knowledge Index can embed installed skills into the vector DB so `project_knowledge_search "git workflow"` finds a relevant skill as a retrieval hit; the Typed Sub-Agent Facets entry's `skillBundle` field resolves through this system so a facet's skill dependencies are fetched deterministically on install. Configured via `sidecar.skills.userRegistry` (git URL or local folder, default empty — opt-in), `sidecar.skills.teamRegistries` (array of git URLs, default empty), `sidecar.skills.marketplace` (URL, default `https://registry.sidecar.ai` but every install still passes through a trust prompt), `sidecar.skills.autoPull` (default `on-start`), `sidecar.skills.autoUpdate` (`manual` weekly daily, default weekly — respects pins), sidecar.skills.trustedRegistries (array of registry URLs that skip the first-install trust prompt; empty by default), sidecar.skills.versions (pin map), and sidecar.skills.offline (default false; when true, no network calls at all).

flowchart TD
    subgraph Tiers ["Distribution tiers"]
        T1[Project-level<br/>.sidecar/skills/<br/>tracked in repo<br/>ALREADY WORKS]
        T2[User-level<br/>userRegistry<br/>git clone to<br/>~/.sidecar/user-skills/]
        T3[Team-level<br/>teamRegistries[]<br/>per-registry subdirs]
        T4[Public marketplace<br/>optional index<br/>still git under the hood]
    end
    A[SideCar activation] --> PULL{autoPull schedule}
    PULL --> T2
    PULL --> T3
    UI[Skills Picker<br/>Browse tab] --> MP[marketplace API]
    MP --> T4
    T1 & T2 & T3 & T4 --> SL[SkillLoader<br/>merges with conflict<br/>resolution by prefix]
    SL --> PICK[Unified picker<br/>tagged by origin]
    PICK --> INV[Skill invoked<br/>respects allowed-tools<br/>from Skills 2.0]

    subgraph Trust ["Trust on install"]
        INST[First install<br/>from new registry] --> PROMPT{trustedRegistries<br/>contains it?}
        PROMPT -->|yes| AUTO[Auto-install]
        PROMPT -->|no| MODAL[Show frontmatter +<br/>source link + Install button]
    end

LaTeX agentic debugging — intercepts compiler output (pdflatex / xelatex / lualatex / bibtex / biber) and closes the loop between the raw log and the source tree without the user ever reading a .log file. When a build fails, a dedicated log-parsing agent classifies each error by type (missing brace, undefined reference, BibTeX key mismatch, undefined control sequence, overfull hbox, missing \end, etc.), maps the reported line number back to the actual offending location accounting for \input / \include transclusion, and stages a targeted fix directly in the Pending Changes diff view — ready to accept with one click. Multi-error runs are handled in a single pass: the agent resolves errors in dependency order (e.g. fix the missing } before re-evaluating the downstream undefined-reference cascade) so the build converges in as few iterations as possible. BibTeX / Biber mismatches get special treatment: the agent cross-references the .bib file, the .aux citations, and the bibliography style to distinguish a missing entry from a key typo from a field-format violation, and proposes the minimal .bib edit. Configured via sidecar.latex.enabled (default true when a .tex file is open) and sidecar.latex.buildCommand (defaults to auto-detected latexmk invocation). Surfaces in the chat UI as a LaTeX Build status-bar item that turns red on failure and opens the agent panel on click.
Research Assistant — Structured Lab Notebook, Experiment Manifests, and Hypothesis Graph — ties the scattered research-adjacent primitives already across this ROADMAP (Literature synthesis, Doc-to-Test Loop, Integrated LaTeX Preview, LaTeX agentic debugging, Visualization Dashboards, Browser-Agent visual verification) and the shipped domain skills (technical-paper, mathematical-proofs, signal-processing, statistics, radar-fundamentals, electromagnetics) into a cohesive lab-notebook workflow so SideCar stops being “a code assistant that happens to know LaTeX” and becomes “an end-to-end research collaborator that happens to also write code.” The gap today: a user running a simulation, collecting results, iterating on an algorithm, and drafting a paper has to hold all the connective tissue in their head — which experiment tested which hypothesis, which figure came from which data run, which citation supports which claim, which parameter sweep produced which plot. SideCar can help with any individual step but has no persistent model of the project as a research artifact. This entry introduces that model. Research Projects as first-class entities live under .sidecar/research/<project-slug>/ (tracked in git — this is curated state, not ephemeral cache, so it stays out of the gitignored subdirs list) with a clean directory structure: project.yaml (top-level metadata: title, question, hypotheses list, status), experiments/<exp-id>/manifest.yaml (one per experiment with reproducibility fields — see below), literature/ (symlinks or copies into the Literature synthesis index with project-specific notes overlaid), figures/<fig-id>/ (source data + generation script + rendered outputs + captured seed), drafts/ (paper sections, poster, slide decks), and observations/<timestamp>.md (timestamped free-form notes the agent and user both contribute to). Experiment Manifest schema — every experiment is a reproducible, content-addressed unit:
```
id: exp-2026-04-16-fir-comparison
hypothesis: "A wavelet-based decomposition outperforms FFT for detecting sub-cycle transients below -40 dB"
parameters:
 sample_rate_hz: 48000
 snr_db: [-40, -35, -30, -25, -20] # sweep
 filter_order: 256
 seed: 42
environment:
 python: "3.11.7"
 requirements_hash: blake3:abc123...
 git_sha: def456...
 hardware: "M3 Max, 64GB unified"
command: "python experiments/fir_vs_wavelet.py --config exp-config.yaml"
artifacts:
 - results.parquet
 - figures/snr_vs_detection.png
 - logs/run.txt
interpretation: "<agent-written or human-written summary of what the results mean>"
supports: [hypothesis-id] # hypothesis this experiment supports or refutes
refutes: []
related_work: [@smith2024, @jones2023]
status: complete # planning | running | complete | abandoned
```
Running /experiment run <id> dispatches the command inside a Shadow Workspace (so the main tree stays pristine), captures every artifact into experiments/<id>/, and automatically populates environment from git state + pip freeze / npm ls / cargo tree + the current hardware probe (reuses the system_monitor tool from v0.57+). Reproducibility is enforced, not advisory — re-running a stored manifest fails loudly if the git SHA has drifted or the requirements hash doesn’t match, with a “reproduce exactly” path that checks out the recorded SHA into a shadow and re-runs against pinned dependencies. Catches the researcher’s-nightmare scenario of “I can’t reproduce my own result from three weeks ago because numpy silently upgraded.” Hypothesis Graph lives alongside the experiment store: nodes are hypotheses (with their status — open / supported / refuted / needs-more-evidence / abandoned), edges are supports / refutes / depends-on / generalizes derived from the experiments’ supports and refutes fields. Rendered in a sidebar Research Board as a force-directed graph (via the Visualization Dashboards MCP layer once that ships, with a Mermaid fallback in the interim), showing which hypotheses have evidence piling up, which are contested (experiments both support and refute), and which are dangling (stated but never tested). The agent treats this graph as first-class context — “we have three experiments supporting H1 but H2 is untested and contradicts H1 — should we run an experiment isolating them?” becomes a suggestion the agent can make, backed by the actual state of your research. New agent tools layered onto the existing 23+ tool catalog: run_experiment(manifest) dispatches a recorded manifest and captures its artifacts; log_observation(text, relatedTo: {experiment? | hypothesis? | figure?}) appends a timestamped observation to observations/ with structured cross-references; test_hypothesis(id) aggregates evidence across linked experiments and returns a verdict with confidence (Bayesian posterior if priors are declared, otherwise a simple experiment-count ratio); find_related_work(topic, depth) walks the Literature graph (via the Literature synthesis index) up to N hops, surfacing papers the project doesn’t yet cite but probably should; suggest_next_experiment(hypothesis) reasons over what would most reduce uncertainty given existing evidence (uses the Thinking Visualization self-debate mode so the user can see the reasoning); validate_statistics(data, test, alpha) runs sample-size / statistical power / effect-size / multiple-comparison checks via a bundled statistics skill-facet and blocks claiming a finding as “supported” until the checks pass; generate_figure(data, spec, caption) produces matplotlib / plotly / tikz output with captured seed + code + parameters, stored as a reproducible figure bundle; draft_section(kind: 'abstract'|'intro'|'methods'|'results'|'discussion'|'related-work', sources) produces a paper section grounded in the actual experiment manifests + literature graph, with every claim traced back to an experiment ID or citation (no unsupported claims survive the generation — composes with the RAG-Native Eval Metrics entry’s faithfulness scorer). Reviewer simulation — before the user shares a paper draft, /review-as <persona> spawns a critic agent wearing a reviewer persona (skeptical-reviewer, domain-expert-reviewer, methods-critic-reviewer all shipped as built-in skills) that reads the draft + underlying experiment manifests and returns structured objections: statistical concerns, missing controls, unsupported claims, related-work gaps, reproducibility red flags. Reuses the existing War Room infrastructure but with research-specific rubrics baked into the critic personas. Statistical validity as a Regression Guard — the validate_statistics check can be registered as a pre-completion guard on the draft_section tool so a paper draft literally cannot be marked done if the underlying experiments don’t clear statistical validity (under-powered n, p-hacking patterns in the parameter sweep, undisclosed multiple comparisons) — composes directly with the Regression Guard Hooks entry in Agent Capabilities. Notebook integration: .ipynb files are first-class experiment artifacts. The agent can execute cells via a Jupyter kernel wrapper tool, capture outputs + figures as proper manifest artifacts, and keep the notebook and any refactored .py module in sync (the Background doc sync entry generalized to code↔notebook). Composition with every earlier entry: Literature synthesis feeds the literature graph and find_related_work; Doc-to-Test Loop verifies the published paper’s claims against the implementation (catches the “what we wrote the paper said vs what the code actually does” drift, which is a common research-integrity hazard); Integrated LaTeX Preview renders the draft with live figures pulled from figures/<id>/; Visualization Dashboards renders the hypothesis graph, experiment timeline, and figure gallery inline; Browser-Agent Visual Verification sanity-checks each generated figure before it’s committed to a draft; Fork & Parallel Solve lets the researcher explore two methodologies in parallel with side-by-side result comparison (the FFT vs wavelet scenario is literally an experiment-fork); Facets give per-domain personas (statistician for validate_statistics, peer_reviewer for review-as, technical_writer for draft_section); Project Knowledge Index indexes the research project so the agent retrieves across past experiments when suggesting new ones; Semantic Time Travel answers “three months ago we thought X about this hypothesis — what experiments changed our mind?”; Regression Guards enforce statistical validity; Shadow Workspaces host experiment runs so the main tree never ships with intermediate scratch files; Audit Mode is appropriate for write-heavy drafting sessions. UI surfaces a Research root in the SideCar sidebar with four sub-panels: Projects (list + active project selector), Experiments (timeline view, status badges, quick-reproduce button), Hypothesis Graph (interactive force-directed view), and Drafts (section-per-tab editor with citation previews on hover). A persistent status-bar item shows Research: <project-slug> · 3 exp running · H2 needs evidence so the user sees project state at a glance. Configured via sidecar.research.enabled (default false — opt-in), sidecar.research.projectsPath (default .sidecar/research/), sidecar.research.activeProject (default auto-detects from CWD or most-recently-touched), sidecar.research.reproduceStrictMode (default true — fail on git-SHA / requirements-hash drift during /experiment reproduce; set false for “best-effort reproduce” in exploratory work), sidecar.research.statisticsGuardEnabled (default true — block draft_section on statistical-validity failures), and sidecar.research.reviewerPersonas (default ['skeptical-reviewer', 'domain-expert-reviewer', 'methods-critic-reviewer'] — extendable with custom persona skill IDs).
```
flowchart TD
 subgraph Project [".sidecar/research/&lt;slug&gt;/ (tracked in git)"]
 M[project.yaml title, question, hypotheses]
 E[experiments/&lt;id&gt;/manifest.yaml + artifacts + env + seed]
 L[literature/ Zotero overlays + notes]
 F[figures/&lt;id&gt;/ data + script + rendered]
 D[drafts/ paper, poster, slides]
 O[observations/&lt;ts&gt;.md timestamped notes]
 end
 H[Hypothesis Graph] --> E
 H --> D
 E --> F
 E --> D

 AG[Agent research tools] --> RUN[run_experiment]
 AG --> LO[log_observation]
 AG --> TH[test_hypothesis]
 AG --> FR[find_related_work]
 AG --> SU[suggest_next_experiment]
 AG --> VS[validate_statistics]
 AG --> GF[generate_figure]
 AG --> DS[draft_section]
 AG --> RV[review-as persona]

 RUN --> E
 VS -.Regression Guard.-> DS
 DS --> D
 GF --> F
 FR --> L
 RV --> D

 U[User] --> UI[Research sidebar: Projects · Experiments · Hypothesis Graph · Drafts]
 UI --> AG
```
First-Class Jupyter Notebook Support — closes a gap that’s currently zero: SideCar has no notebook awareness at all. read_file on an .ipynb returns raw JSON (unreadable to the model, useless for reasoning); edit_file risks corrupting the JSON schema because the agent can’t see cell boundaries; VS Code’s native vscode.NotebookEdit / NotebookData / NotebookController APIs are unused; there’s no way to run a cell and read its output — which is the whole point of notebooks for the scientific, data, and research workflows the Research Assistant entry above depends on. This entry adds a complete, cell-aware notebook surface built on the native VS Code APIs. Eight new agent tools replace naive text handling of .ipynb files, each dispatching through the native notebook APIs so the underlying JSON schema stays intact and the user’s notebook editor reflects agent edits in real time just like human edits do: (1) read_notebook(path, { includeOutputs?, maxOutputChars? }) returns structured { cells: [{ index, kind: 'code' | 'markdown' | 'raw', language, source, outputs?: NotebookOutput[], metadata }] } — outputs are optional because they balloon context (a single matplotlib plot is ~50k base64 chars), and when included they’re truncated to maxOutputChars per cell with a truncated: true flag; (2) edit_notebook_cell(path, cellIndex, newSource) surgically replaces one cell’s source without touching surrounding cells, outputs, or metadata — routed through vscode.NotebookEdit.updateCellText; (3) insert_notebook_cell(path, atIndex, source, kind, language?) creates a new cell at a specific position via NotebookEdit.insertCells; (4) delete_notebook_cell(path, cellIndex) removes a cell cleanly via NotebookEdit.deleteCells; (5) reorder_notebook_cells(path, [newOrder]) shuffles cells (useful when refactoring exploration notebooks into linear presentation order); (6) run_notebook_cell(path, cellIndex, { timeoutMs? }) executes a cell via the notebook’s attached NotebookController and returns structured outputs — text, tables, base64 images (auto-piped to Visual Verification when that feature is enabled and the cell produces a plot), stderr, execution count, elapsed time, and a kernelError? field with stack trace when execution fails; (7) run_notebook_all(path, { stopOnError?, maxCellMs? }) executes every code cell in order, streaming progress back to the agent as each completes so long-running notebooks don’t block on a single response; (8) generate_notebook(path, { outline, template?, kernel? }) creates a new .ipynb from scratch with scaffolded cells — built-in templates ship for common shapes (data-exploration, signal-processing-analysis, paper-figure-reproduction, experiment-sweep, tutorial-walkthrough), and the outline can be a free-form list of cell descriptions the model fills in. Roundtrip fidelity is a hard invariant: reading a notebook → making an edit → writing it back preserves cell IDs, execution counts, cell metadata, kernel specs, language info, and (when the user didn’t ask for output changes) every existing output byte-for-byte. Enforced with a unit-level property test — a fuzzing harness that reads → no-op edits → writes 500 realistic notebooks and asserts byte equality. Catches the classic AI-assistant-corrupts-my-notebook failure mode before it ships. Cell-aware streaming diff previews extend the existing streamingDiffPreviewFn so a multi-cell edit shows each cell’s diff in its own collapsible tile in the Pending Changes panel, not a single monolithic JSON-level diff (which is what the current raw-file path produces and which is useless for reviewing). Inserts / deletes / reorders get their own visual treatment so the user sees structural changes distinctly from content changes. Kernel handling: the agent respects the notebook’s attached kernel — if the user already selected “Python 3.11 (venv)”, agent tool calls execute there; no kernel attached triggers a one-time prompt via the existing approval system (“no kernel attached, select one or install the recommended ipykernel in .venv?”). Multi-language notebooks (Jupyter supports them) work — each cell’s declared language drives which kernel subprocess handles it. Execution outputs cap at sidecar.notebooks.maxOutputChars (default 2000) per cell for the returned-to-agent view; the full output always persists in the notebook file regardless — truncation is for the agent’s working context, not for durable state. Output-to-Visual-Verification bridge: when run_notebook_cell produces a base64 image output and sidecar.visualVerify.enabled is true, the image auto-flows into the Visual Verification pipeline (cheap checks for blank/clipped/axes-missing, optional VLM for criterion-matching) without the agent having to manually invoke analyze_screenshot — so a matplotlib plot in a research notebook gets the same vision-guided correctness loop that the Browser-Agent entry describes for web preview. Merge-conflict handling: .ipynb merges are notoriously bad in git because the JSON format serializes outputs, execution counts, and cell IDs into the diff. This entry doesn’t solve git-level merging (out of scope) but does make SideCar’s own conflict view cell-aware: when the Audit Mode treeview or Pending Changes panel detects a buffered notebook write colliding with an on-disk change, the three-way merge editor opens at the cell granularity rather than the JSON-line granularity. Integration with every earlier entry: Research Assistant treats .ipynb as a first-class experiment artifact — run_notebook_all on an experiment manifest’s notebook is the canonical reproduce path; Browser-Agent Visual Verification auto-hooks cell plot outputs; Regression Guards can register trigger: post-write with command: jupyter nbconvert --execute --to notebook --inplace to enforce that every notebook edit keeps the notebook runnable; Doc-to-Test Loop can synthesize .ipynb tests from paper figures (generated cells that reproduce each figure get faithfulness-checked); Fork & Parallel Solve lets each fork contain its own notebook variant for side-by-side methodology comparison; Merkle Index chunks notebooks at the cell level (each cell is its own Merkle leaf, so a one-cell edit re-hashes one leaf not the whole notebook); Project Knowledge Index’s symbol extractor recognizes notebook cells as first-class chunks alongside TS/Python functions; Shadow Workspaces run notebooks in the shadow kernel so the main tree’s cached outputs aren’t perturbed during iteration; Audit Mode’s treeview shows per-cell diffs for buffered notebook writes. Built-in code↔notebook sync (the feature mentioned in Research Assistant): when a .py module and a sibling .ipynb both declare a symbol (function, class), the agent keeps them in step — edits to the .py module prompt the agent to update the corresponding .ipynb cell and vice versa, with conflicts surfaced as a three-way merge. Configured via sidecar.codeNotebookSync.pairs (array of { module, notebook } path pairs); absent = no-op. Configured via sidecar.notebooks.enabled (default true once a notebook is opened or created in the workspace), sidecar.notebooks.includeOutputsInRead (default false — outputs bloat context; agent asks explicitly when needed), sidecar.notebooks.maxOutputChars (default 2000), sidecar.notebooks.autoExecuteOnEdit (default false — agent edits don’t auto-run cells; explicit /run or run_notebook_cell is required), sidecar.notebooks.visualizeOutputsInVLM (default true when Visual Verification is enabled), sidecar.notebooks.cellGranularDiff (default true — cell-tile view; false falls back to raw JSON diff for debugging), and sidecar.notebooks.templates (array of template paths for generate_notebook beyond the built-ins).
```
flowchart TD
 A[Agent] --> T{Notebook tool}
 T --> RN[read_notebook structured cells + optional outputs]
 T --> EN[edit_notebook_cell via NotebookEdit.updateCellText]
 T --> IN[insert_notebook_cell via NotebookEdit.insertCells]
 T --> DN[delete_notebook_cell via NotebookEdit.deleteCells]
 T --> RC[run_notebook_cell via NotebookController.executeHandler]
 T --> RA[run_notebook_all streaming per-cell progress]
 T --> GN[generate_notebook templates + outline]

 EN & IN & DN --> WE[workspace.applyEdit WorkspaceEdit with NotebookEdit entries]
 WE --> IPY[.ipynb on disk]
 WE --> CELL_DIFF[Cell-granular diff in Pending Changes]

 RC --> OUT{Output kind}
 OUT -->|text / table| TXT[Back to agent, truncated to maxOutputChars]
 OUT -->|image base64| VV{visualVerify enabled?}
 VV -->|yes| VVP[auto-flow into Visual Verification pipeline]
 VV -->|no| TXT
 OUT -->|kernelError| ERR[Structured error + stack trace to agent]

 GN --> TPL[Built-in templates: data-exploration / signal-processing / paper-figure-repro / experiment-sweep]

 subgraph Invariants
 FID[Roundtrip fidelity: read → no-op edit → write = byte-equal property-tested]
 end
```

Multi-Agent

Worktree-isolated agents — each agent in its own git worktree
Agent dashboard — visual panel for running/completed agents
Multi-agent task coordination — parallel agents with dependency layer
Remote headless hand-off — detach tasks to run on a remote server via @sidecar/headless CLI
Multi-agent War Room — a red-team review layer that runs before output ever reaches the user. A lead Critic Agent adversarially challenges the coding agent’s solution (logic, security, edge cases, architecture), the coding agent rebuts and revises, and the exchange continues for a configurable number of rounds until the critic is satisfied or escalates to the user. The full debate is streamed live in a dedicated War Room sidebar panel so you can watch the agents argue in real time. Builds on the existing runCriticChecks / HookBus infrastructure — the critic becomes a first-class peer agent rather than a post-turn annotation pass. Configurable via sidecar.warRoom.enabled, sidecar.warRoom.rounds (default: 2), and sidecar.warRoom.model (can point to a different, cheaper model for the critic role).

User Experience

Integrated LaTeX Preview & Compilation — a first-class technical writing workflow built on top of the agent tool system. The agent gains a write_latex tool that creates and edits .tex files with full awareness of document structure (preamble, environments, bibliography). A background compilation watcher runs latexmk (or tectonic as a zero-config fallback) on every save, parses the log for errors and undefined citations, and surfaces them as inline diagnostics in the editor. A Ghost Preview panel opens beside the source and renders the compiled PDF (or a KaTeX/MathJax live render of the current math block when a full compile is pending), giving a true side-by-side experience without leaving VS Code. Bibliography integrity is checked separately — missing \cite{} keys and malformed .bib entries are flagged before the compile even runs. Configurable via sidecar.latex.compiler (latexmk tectonic), sidecar.latex.ghostPreview.enabled, and sidecar.latex.bibCheck.enabled.

Background doc sync — silently update README/JSDoc/Swagger when function signatures change (2/3 shipped: JSDoc staleness diagnostics flag orphan/missing @param tags with quick fixes; README sync flags stale call arity in fenced code blocks with rewrite quick fixes. Swagger deferred — framework-specific, no in-repo OpenAPI spec to dogfood against; will revisit when a real use case lands.)
Zen mode context filtering — /focus <module> to restrict context to one directory

Suggestion Mode — inverted-default approvals (flow-preserving UX) — a fundamental reframing of the tool-dispatch UX from “we’ll run it unless you stop us” to “here’s what I’d do, click to apply.” Today approvals in cautious mode (default) interrupt the developer’s flow: destructive tools pop a native modal (chatState.ts:242-250) and non-destructive ones render an inline confirm card (chatState.ts:255-259) the user must dismiss before the agent proceeds. Even inline cards are blocking from the agent’s POV — confirmFn awaits the promise before executeTool returns. Both surfaces assume a binary accept/reject and force a context switch from writing-code-alongside-the-agent to reviewing-an-interrupt. The entire toolPermissions: 'allow' | 'deny' | 'ask' axis (executor.ts:252-256) is static — there’s no “remember my choice for this session” affordance and no way to convert the interrupt into a non-blocking preview.

The flip: a new approval style sidecar.approvals.style: 'modal' | 'inline' | 'suggestion' (default stays inline to preserve existing behavior; users opt into suggestion when ready). In suggestion mode, a would-be tool call doesn’t pause the agent — it materializes as a preview card in the chat transcript with the full payload visible (diff for write_file/edit_file, command text for run_command, search query for grep, etc.) and a one-click Apply / Skip / Edit & apply affordance. The agent’s call returns synthetically as suggested rather than executed, so the loop keeps moving: the next iteration sees a tool result like "Suggested write_file:src/auth.ts — user has not applied yet" and reasons accordingly (it might ask the user in text, move on to independent work, or queue a dependent call that flips to pending-apply until the user acts). Nothing blocks; the developer scrolls through suggestions at their own pace, applying in order or out of order. This inverts the trust model: instead of the user being the brake on an agent sprinting forward, the user is the throttle gating each action in — closer to how Copilot Edits, Cursor’s Agent mode, and Continue.dev’s accept-per-hunk flow treat high-autonomy edits.

Why this solves the specific pain — the current UX problem isn’t the existence of approvals (security and trust depend on them) but the shape of the interrupt. A 20-file refactor currently fires 20 inline cards, each blocking until dismissed; the developer can’t keep writing code in another file while waiting because the agent is paused too. In suggestion mode, all 20 fire as non-blocking cards, the agent continues reasoning (producing downstream suggestions that depend on earlier ones as pending-apply), and the developer drains the queue at their own cadence — or applies all at once from a panel summary. Multi-File Edit Streams (v0.65) already plans edits as a DAG; suggestion mode naturally pairs with that, showing the Planned Edits card with per-edit Apply buttons instead of running writes behind the user’s back.

Mechanism and infrastructure changes required:

New SuggestionStore — process-wide singleton holding SuggestedAction { id, tool, input, rationale, createdAt, status: 'pending' | 'applied' | 'skipped' | 'edited', dependsOnIds: string[] }. The executor’s approval gate (executor.ts:303-401) branches on config.approvals.style === 'suggestion': instead of calling confirmFn, it pushes a SuggestedAction into the store and returns a synthetic ToolResultContentBlock with is_error: false and a structured payload the agent can reason over ({ status: 'suggested', suggestionId, summary }).
Webview protocol extension — new outgoing commands suggestionCreated, suggestionApplied, suggestionSkipped, suggestionEdited; new incoming commands applySuggestion, skipSuggestion, editSuggestion. Carries the full tool input so the preview can render syntax-highlighted content, a unified diff (for file writes via the existing streamingDiffPreview renderer), or a command transcript (for run_command).
Chat UI tile per suggestion — styled like the Planned Edits card (v0.65 chunk 4.4a) with theme-token badges per tool type, a path / command summary line, expandable full-payload details, and three buttons: Apply (executes via executeOneToolUse with the original context), Skip (marks status: 'skipped', surfaces as a “not applied” tool_result on the next turn so the agent knows), Edit & apply (opens the tool input in a modal editor — tweak the shell command, adjust file content, rewrite the grep pattern — then apply the modified version; applied suggestions carry an edited: true flag the agent sees). Inline keyboard shortcuts: ⌘⏎ applies, Escape skips, e edits.
Dependency tracking — when a suggestion’s input references a path another pending suggestion would create or modify, we mark dependsOnIds. The UI badges dependent suggestions as awaiting-parent and greys the Apply button until prerequisites land, preventing the “apply a suggestion that edits a file that doesn’t exist yet” footgun.
Bulk actions on the summary panel — a persistent Pending Suggestions (N) strip above the chat input (reusing the steer-queue-strip layout from v0.65 chunk 3.3): Apply all (topologically), Skip all, Apply file-writes only (for when you trust edits but want to review shell commands individually). Each bulk action confirms once with a modal rather than firing N modals.
Session-scoped “auto-apply” affordance — a checkbox on each suggestion: “Auto-apply future write_file on src/auth/**“ converts that pattern into a session-scoped allowlist so repeated identical suggestions on the same surface auto-apply. Decays at session end (not a persistent setting — opposite failure mode from a global quiet-mode switch where users forget it’s on). Backed by a new SessionAllowlist interface on ChatState that the approval gate consults before creating a suggestion.

What stays blocking: suggestion mode is opt-out-able per tool via sidecar.approvals.alwaysConfirm: string[] (default ['run_command', 'git_push', 'delete_file']). Truly destructive ops still fire the existing native-modal path because the cost of an “oops I clicked Apply by accident” on rm -rf is not recoverable. The NATIVE_MODAL_APPROVAL_TOOLS list (chatState.ts:242) becomes the default for alwaysConfirm and users can tighten or loosen it per taste. Suggestion mode is for the common case of file edits + reads + searches, which is where the flow-breaking accumulates; the truly destructive gate stays in place.

Integration with every earlier entry: Multi-File Edit Streams (v0.65) renders its Planned Edits card’s per-file entries as suggestions natively — each DAG node becomes a SuggestedAction and the existing dependency layering maps 1:1 to the suggestion store’s dependsOnIds. Steer Queue (v0.65) remains the mid-run course-correct channel — a steer queued while suggestions are pending can say “skip the src/legacy/** ones” and the summary strip honors that. Shadow Workspaces stay compatible — applying a suggestion in suggestion mode routes through executeOneToolUse which honors cwdOverride, so approved suggestions land in the shadow tree exactly as today’s approved writes do. Audit Mode becomes redundant for write_file in suggestion mode (the SuggestionStore IS the buffer; the user reviews + applies directly) but stays relevant for run_command and other non-write tools. Regression Guards fire against the applied set, not the suggested set — if the user skips half, guards only see what landed. Fork & Parallel Solve shows each fork’s suggestions in its own column of the Fork Review panel.

Phased rollout: phase 1 ships style: 'suggestion' behind an opt-in flag with the SuggestionStore, webview tiles, and basic Apply/Skip — no editing, no dependency tracking, no bulk actions. Phase 2 adds Edit & apply, dependsOnIds, and bulk actions. Phase 3 adds session-scoped auto-apply patterns and per-tool alwaysConfirm tuning. Default remains inline through all three phases; user-opt-in only becomes the default after a release of telemetry-backed validation that Apply/Skip/Edit rates match the “non-blocking wins” hypothesis (users apply >80% of file-write suggestions with <5% rework).

Configured via sidecar.approvals.style (modal inline suggestion, default inline), sidecar.approvals.alwaysConfirm (string[], default ['run_command', 'git_push', 'delete_file']), sidecar.approvals.autoApplyPatterns (session-scoped — UI-driven, not persisted; shown here for discoverability), sidecar.approvals.showDependencyEdges (default true), and sidecar.approvals.bulkConfirmThreshold (default 5 — above this many suggestions, Apply all requires one confirm click rather than silently running).

Dependency drift alerts — real-time feedback on bundle size, vulnerabilities, and duplicates when deps change

Observability

RAG-Native Eval Metrics (RAGAs) + Qualitative LLM-as-Judge (G-Eval) — reopens the LLM-as-judge scoring deferral from v0.50 (documented at ROADMAP.md under Eval harness gaps: “deterministic predicates give crisper regression signal than a second-model scoring hop, so this was intentionally skipped… reopen if we start shipping features where correctness is fuzzy rather than binary”). The deferral holds up for the features that existed at v0.50 — tool-trajectory assertions, file-state substring matches, mustContain/mustNotContain predicates on final output were the right call. But the features added since and pending across this ROADMAP (Project Knowledge Index with graph-fusion retrieval, Merkle-addressed fingerprints, Fork & Parallel Solve with its Judge mode, Doc-to-Test constraint extraction, Browser-Agent Visual Verification, Thinking Visualization modes) all have correctness surfaces that are fuzzy — retrieval quality, answer faithfulness, reasoning coherence, visual-check calibration — and trying to keep these honest with only deterministic predicates leaves a regression blind spot. This entry extends the existing tests/llm-eval/ harness with two complementary metric layers, kept additive: deterministic predicates still gate on mustContain and tool trajectories (cheap, reliable, first line of defense); fuzzy metrics layer on top as optional per-case expectations the CI also gates on. Layer 1 — RAGAs metrics for retrieval-augmented features (Project Knowledge Index, monorepo cross-repo search, Literature synthesis, Memory Guardrails): four core scorers implemented as JS-native LLM-as-judge calls, not a Python subprocess dependency on the ragas package — the metrics are simple enough to reimplement cleanly (each is a prompt + a parser), and the VS Code extension shouldn’t drag Python into its deployment story. (1) Faithfulness — does the generated answer only claim things supported by retrieved context? Judge decomposes the answer into atomic claims, then for each claim asks “is this entailed by the retrieved context?”; score = entailed_claims / total_claims. Catches hallucination where the agent invents facts not in retrieved docs. (2) Answer Relevancy — does the answer actually address the user’s question? Judge generates N alternative questions the answer would have correctly responded to, compares their embedding to the original question’s, scores by mean cosine similarity. Catches off-topic drift. (3) Context Precision — did retrieval rank relevant chunks higher than irrelevant ones? Judge rates each returned chunk as relevant / irrelevant to the ground-truth answer, then computes mean reciprocal rank weighted by relevance. Catches “the right file was in position 8 but position 1 was a red herring” regressions that a flat “was the right file retrieved?” metric misses. (4) Context Recall — did retrieval find all the chunks needed for the ground-truth answer? Judge decomposes the ground truth into atomic claims, for each asks “is there a retrieved chunk that supports this?”; score = supported_gt_claims / total_gt_claims. Catches missing-needle failures that only Context Precision can’t detect. Cases declare these via a new rag expectations block: expect: { rag: { faithfulness: { min: 0.85 }, contextPrecision: { min: 0.7 }, contextRecall: { min: 0.8 } } }. Layer 2 — G-Eval qualitative scoring for fuzzy output aspects (coherence, correctness on ambiguous tasks, style, custom criteria) implemented as a generic LLM-as-judge scorer with a common chain-of-thought template inspired by DeepEval’s G-Eval — again re-implemented in TS rather than shelled out to the Python package. Each G-Eval scorer takes a name, a description of what’s being measured, and a 1-N rating scale; the judge generates a CoT reasoning trace, then emits a numeric score with justification. Built-in criteria ship pre-tuned: coherence (does the response follow a logical structure?), correctness (given the task description, is the output free of errors?), relevance (does it address what was asked?), fluency (well-formed prose), actionability (can the user act on the answer without clarification?); custom criteria are user-declarable via sidecar.eval.gEvalCriteria with a name, description, and scale. Used by cases as expect: { gEval: { coherence: { min: 7 }, correctness: { min: 8 } } }. Judge’s full reasoning is captured in the eval report so regressions come with why they’re regressions, not just “score dropped 0.4 → 0.3.” Shared LLM-as-judge primitive backs both layers at tests/llm-eval/scorers/llmJudge.ts — a single dispatch point that handles judge-model routing (via Model Routing rules’ judge role so cheap-judge vs gold-judge is configurable), caches results aggressively to .sidecar/cache/eval-judge/ keyed by (judgeModel, promptHash, inputHash) so re-running the suite against unchanged inputs is free, and supports cheap-judge-first / gold-judge-on-borderline for cost control: run Haiku on every case, escalate to Sonnet only when Haiku’s score is near the pass threshold (within a configurable margin) so close calls get the better judge but clear passes/fails don’t burn the budget. Ground-truth curation workflow: RAGAs recall requires ground-truth answers, which the current harness doesn’t collect. A new tests/llm-eval/ground-truth/ directory stores per-case ground truths as markdown + YAML frontmatter ({ answer: "...", supportingFacts: [...], requiredContext: [...] }); a /curate-ground-truth CLI walks uncurated cases, generates draft ground truths via the judge model, and surfaces them in a review UI where the human edits and commits. The workflow is explicit about provenance: ground truths carry a curator: human | model | model-reviewed tag in frontmatter so eval reports can flag metrics computed against unreviewed model-generated truths as tentative rather than authoritative. Regression tracking surface: eval report output extends the existing text summary with per-metric trend data (faithfulness: 0.87 (↓ 0.03 from prev)) and a CI-friendly tests/llm-eval/history.jsonl append-only log of each run’s metrics keyed by git SHA, so npm run eval:report can render a 30-day chart showing whether retrieval precision is drifting as the Merkle index changes, faithfulness is regressing as prompts evolve, or coherence is degrading on cheaper-model runs. Cost controls: sidecar.eval.judgeBudgetPerRun (default $1.00 USD equivalent — a full RAG+G-Eval suite with Haiku-judge costs ~$0.10–0.30 typically, so this is conservative); exceeding the budget skips the remaining fuzzy scorers with a visible warning rather than billing-surprising the user. Deterministic scorers always run — they’re free. Composes with every earlier retrieval entry: Project Knowledge Index acceptance criteria become concrete RAGAs thresholds (context precision must not regress after symbol-chunking migration); Merkle fingerprint stability becomes a test (same root → identical retrieval output → identical RAG scores, which is a stronger regression signal than per-feature tests); Fork & Parallel Solve’s built-in Judge mode reuses the same llmJudge primitive so its in-runtime scoring is consistent with the offline eval scoring; Doc-to-Test Loop’s synthesized tests get faithfulness-checked against the source doc; Visual Verification’s VLM verdicts get a coherence check via G-Eval. Configured via sidecar.eval.ragMetrics (array of enabled RAGAs scorers, default ['faithfulness', 'answerRelevancy', 'contextPrecision', 'contextRecall']), sidecar.eval.gEvalCriteria (record of name → { description, scale: [1, N] } for custom criteria beyond the built-ins), sidecar.eval.judgeBudgetPerRun (default 1.00), sidecar.eval.cheapJudgeModel (default inherits from Model Routing judge role), sidecar.eval.goldJudgeModel (default empty — disables gold escalation if unset), sidecar.eval.goldJudgeMargin (default 0.1 — escalate to gold when cheap-judge score is within this margin of the threshold), and sidecar.eval.cacheDir (default .sidecar/cache/eval-judge/, covered by the gitignored-subdirs carve-out).
```
flowchart TD
 CASE[Eval case with expect: mustContain + rag + gEval blocks] --> RUN[Run SideCar agent on input]
 RUN --> OUT[Final output + retrieved context + tool trajectory]
 OUT --> DET[Deterministic scorers mustContain, trajectory, file-state]
 OUT --> RAG{RAGAs scorers}
 OUT --> GEV{G-Eval scorers}
 RAG --> FA[Faithfulness: atomic claims vs context]
 RAG --> AR[Answer relevancy: generated questions ≈ input]
 RAG --> CP[Context precision: weighted MRR]
 RAG --> CR[Context recall vs ground truth]
 GEV --> COH[Coherence 1-10]
 GEV --> COR[Correctness 1-10]
 GEV --> CUSTOM[User criteria]
 FA & AR & CP & CR & COH & COR & CUSTOM --> JUDGE[LLM-as-judge cheap first → gold on borderline]
 JUDGE --> CACHE[(.sidecar/cache/eval-judge/ judgeModel + promptHash)]
 DET & JUDGE --> AGG[Aggregate result]
 AGG --> HIST[Append to history.jsonl by SHA]
 HIST --> REPORT[Trend report per-metric deltas + judge reasoning traces]
```
Model comparison / Arena mode — side-by-side prompt comparison with voting
Role-Based Model Routing & Hot-Swap — replaces SideCar’s current scatter of per-role model settings (sidecar.model, sidecar.completionModel, sidecar.critic.model, sidecar.delegateTask.workerModel, sidecar.fallbackModel, and the plannerModel / judgeModel / vlm knobs added in other roadmap entries) with a unified, declarative rule set that routes each dispatch to the right model for its actual job — so you can run Llama 3 for free local chat, promote to Claude Sonnet/Opus for the high-reasoning agent loop, and drop to Haiku for cheap summarization, all in one coherent config. The target experience: ultra-pro intelligence exactly where it earns its keep (the multi-turn agent loop, the War Room critic, the planner pass before a wide refactor) with the rest of the session staying free and local. Rule shape:
```
"sidecar.modelRouting.rules": [
 // First match wins — list most specific first.
 { "when": "agent-loop.complexity=high", "model": "claude-opus-4-6" },
 { "when": "agent-loop", "model": "claude-sonnet-4-6" },
 { "when": "chat", "model": "ollama/llama3:70b" },
 { "when": "completion", "model": "ollama/qwen2.5-coder:7b" },
 { "when": "summarize", "model": "claude-haiku-4-5" },
 { "when": "critic", "model": "claude-haiku-4-5" },
 { "when": "worker", "model": "ollama/qwen3-coder:30b" },
 { "when": "planner", "model": "claude-haiku-4-5" },
 { "when": "judge", "model": "ollama/qwen2.5-coder:7b" },
 { "when": "visual", "model": "claude-sonnet-4-6" },
 { "when": "embed", "model": "local/all-MiniLM-L6-v2" }
]
```
Role taxonomy (every dispatch point in SideCar is tagged with one): chat (one-off Q&A without tools), agent-loop (multi-turn tool-using work), completion (FIM autocomplete), summarize (ConversationSummarizer, prompt pruner, tool-result compressor), critic (War Room critic, completion-gate critic), worker (delegate_task local research worker), planner (edit-plan pass, fork approach planner), judge (fork judge, constraint-approval scoring), visual (screenshot VLM for browser-agent verification), embed (Project Knowledge Index vectors — this one is provider-specific and rarely overridden, but exposed for completeness). Compound match expressions — rules can include signal filters after the role: agent-loop.complexity=high (turn count × tool fan-out × file span exceeds threshold), agent-loop.files~=src/physics/** (glob match on files the turn is touching), chat.prompt~=/pro\b|think hard/ (explicit user cue in the prompt), agent-loop.retryCount>=3 (escalate on recurring failure). Signals are computed cheaply before each dispatch and passed to the router along with the role. Hot-swap is literal: within a single conversation, the active model changes at role boundaries — SideCarClient.updateModel() already exists, so the ModelRouter service just calls it with the rule-resolved choice before each dispatch. Message history is preserved across swaps (all backends speak compatible message shapes for the roles we swap into); tool definitions are unchanged; Anthropic prompt-cache breakpoints survive within a same-model run so the 90% cached-read discount doesn’t get reset by a cross-role swap to a different provider. Cost visibility: a status-bar item shows the current active model with a tooltip breaking down this session’s spend by role (agent-loop: $0.42 (sonnet) · chat: $0.00 (local llama) · summarize: $0.03 (haiku)) so users see exactly where their money is going. Budget-aware downgrade: each rule can declare a dailyBudget / sessionBudget / hourlyBudget and an optional fallbackModel; when the cap trips, the router silently downgrades (claude-opus-4-6 → claude-sonnet-4-6 → claude-haiku-4-5 → ollama/qwen3-coder:30b) and surfaces a single non-blocking toast. One-off override via the /model <name> slash command for the rest of the session regardless of rules, plus @opus, @sonnet, @haiku, @local inline sentinels in the user message that bypass routing for just that turn. Migration from existing per-role settings is automatic: on first activation with modelRouting.rules set, SideCar translates any non-default sidecar.completionModel / sidecar.critic.model / etc. into synthesized rules and writes them into the new config, keeping the old fields as no-ops for backward compat. Users without modelRouting.rules keep the current per-field behavior — zero migration cost for the simple case. Composes with every earlier entry: Skills 2.0’s preferred-model frontmatter becomes a per-skill rule injected for the skill’s lifetime; Facets’ preferredModel becomes a per-facet rule; Fork & Parallel Solve can declare per-fork model rules (fourier on Sonnet, wavelet on Haiku for cost comparison); the GPU-Aware Load Balancing feature’s auto-downgrade on VRAM pressure becomes one of the router’s triggers rather than a parallel code path; Audit Mode can require confirmation when the router would escalate to a paid model without user awareness. Ad-hoc complexity heuristic for agent-loop.complexity=high (tunable, good defaults): turn count >= 5 OR distinct-files-touched >= 3 OR consecutive-tool-use-blocks >= 8 OR user prompt contains explicit reasoning cues (prove, verify, reason through, think step by step). The heuristic is boring on purpose — anything smarter invites surprises about why a cheap session suddenly escalated. Configured via sidecar.modelRouting.enabled (default false — opt-in until users have calibrated rules), sidecar.modelRouting.rules (ordered rule list, first match wins), sidecar.modelRouting.defaultModel (fallback when no rule matches, defaults to sidecar.model), sidecar.modelRouting.visibleSwaps (default true — show a brief toast on model swap so the user knows what happened; false for silent operation once calibrated), and sidecar.modelRouting.dryRun (default false; when true, the router logs what it would have selected but sticks with sidecar.model, for safely calibrating rules before enabling them).
```
flowchart TD
 D[Dispatch point] --> ROLE[Tag role: chat / agent-loop / completion / summarize / ...]
 ROLE --> SIG[Compute signals: complexity, files, retries, prompt cues]
 SIG --> RULES{Match rules top-down}
 RULES -->|first match| BUDG{Budget ok?}
 BUDG -->|yes| SWAP[updateModel to rule's choice]
 BUDG -->|exhausted| FALL[Fallback model or chain to next rule]
 FALL --> BUDG
 SWAP --> DISP[Dispatch to backend]
 DISP --> TRACK[Track spend per role]
 TRACK --> STATUS[Status bar: active model + tooltip spend breakdown]
 RULES -->|no match| DEF[defaultModel]
 DEF --> DISP
```
GPU-Aware Load Balancing — SideCar monitors VRAM pressure in real time (via nvidia-smi, rocm-smi, or the Metal Performance HUD on Apple Silicon) and automatically backs off when a competing workload — such as a PyTorch/JAX training run — is detected consuming significant VRAM. Three escalating responses: (1) silent downgrade — swap to a smaller quantised variant of the current model (e.g. q8_0 → q4_K_M) if one is available locally; (2) user prompt — if no smaller local model is available, surface a non-blocking toast offering to switch to a cloud provider (Anthropic / OpenAI) for the duration of the heavy workload; (3) pause & queue — if the user dismisses the toast, queue pending agent turns and retry once VRAM headroom recovers. Restores the original model automatically when pressure drops below the threshold. Configurable via sidecar.gpuLoadBalancing.enabled, sidecar.gpuLoadBalancing.vramThresholdPercent (default: 80), sidecar.gpuLoadBalancing.fallbackModel, and sidecar.gpuLoadBalancing.cloudFallbackProvider.
Real-time code profiling — MCP server wrapping language profilers

Security & Permissions

Granular permission controls — per-category tool permissions, upfront scope requests
Enhanced sandboxing — constrained environments for dangerous tools
Customizable code analysis rules — sidecar.analysisRules with regex patterns and severity

Providers & Integration

Remote PR Review Automation — Fetch, Analyze, Post Line-Anchored Comments — extends the shipped local reviewCurrentChanges into a proper remote PR review loop. Today sidecar.reviewChanges runs on whatever’s in the local working tree; if the user wants to review someone else’s PR they have to git fetch && git checkout manually first. This entry adds SideCar: Review Pull Request <#> which takes a PR number (or owner/repo + number, or a full GitHub URL), fetches the PR’s unified diff via /repos/:owner/:repo/pulls/:number + /repos/:owner/:repo/pulls/:number/commits + /repos/:owner/:repo/pulls/:number/comments, runs the reviewer against the fetched diff plus the existing comment thread context (“the reviewer already flagged the auth regression in comment #47 — don’t re-flag it”), and posts line-anchored review comments back via POST /repos/:owner/:repo/pulls/:number/comments with the path + line + side + commit_id the GitHub API requires. Structured reviewer output: the reviewer prompt is extended to emit JSON-tagged findings — { path, line, side: 'RIGHT' | 'LEFT', severity: 'block' | 'suggest' | 'nit', message, suggestedChange? } — so the poster can route block findings to a requested-changes review, suggest to regular comments, and nit to resolved discussions by default. Dry-run by default: first run produces a preview webview listing every proposed comment; the user picks which to post. sidecar.pr.review.autoPost: true opts into posting directly (for CI bots / automation accounts). Composes with Skills: the review-code skill that already ships becomes the default prompt for remote PR review; project-local review skills in <workspace>/.sidecar/skills/ override for domain-specific review rules (security-focused PRs, performance-sensitive modules). Composes with Facets: a batch of facets can each review the same PR — security-reviewer looks for auth/injection issues, test-author flags missing test coverage, general-coder catches logic bugs — and the aggregated-review UI from v0.66 merges their findings with per-facet tags so the user sees “security-reviewer flagged lines 42-48 for CSRF, test-author flagged lines 12-20 for missing test, general-coder had no issues.” Configured via sidecar.pr.review.defaultSkill (default review-code), sidecar.pr.review.severityMapping (maps the three severity tiers to review event types — default block → REQUEST_CHANGES, suggest → COMMENT, nit → COMMENT), sidecar.pr.review.autoPost (default false), and sidecar.pr.review.includeExistingComments (default true — set false to do a clean review that ignores prior reviewer signal).
CI Failure Analysis & Fix — GitHub Actions Log Ingestion with Proposed Repair Commits — closes the gap between “CI failed on my PR” and “I know why and how to fix it.” Today SideCar’s Terminal Error Interception (shipped) catches failures in the integrated terminal; this entry extends the same flow to remote CI. SideCar: Analyze Failed CI Run fetches the latest failed run for the current branch via /repos/:owner/:repo/actions/runs?branch=...&status=failure&per_page=1, downloads the failed job’s log via /repos/:owner/:repo/actions/jobs/:job_id/logs (with 4 MB cap; on overflow, uses tail semantics via a Range header), extracts the failing step’s log slice using the ##[endgroup] / ##[error] markers GitHub Actions emits, and feeds it through the same diagnose-in-chat synthesized-prompt path that terminal errors already use. PR-aware mode: when the current branch has an open PR, the flow auto-detects it and offers “Propose a fix commit” — the agent diagnoses the failure, opens a new <branch>-fix-ci branch in a Shadow Workspace, makes the fix, runs local tests, and opens a draft follow-up PR or pushes directly onto the original branch (gated by user approval). Log parsing: per-runner-type (Linux / macOS / Windows) regexes strip ANSI, collapse timestamp prefixes, detect test-runner output patterns (vitest / jest / pytest / go test / cargo test / rspec — use existing TestRunnerRegistry), and surface the test that failed + the assertion message rather than the raw 4 MB log. Composes with Actions filter: a new sidecar.ci.analysis.jobFilter (glob array against job name) lets users scope to the jobs that matter — if CI has a lint job and a test job, analyzing the test failure first is usually right. Configured via sidecar.ci.analysis.enabled (default true), sidecar.ci.analysis.maxLogBytes (default 4_000_000), sidecar.ci.analysis.jobFilter (default ["*"]), and sidecar.ci.analysis.autoProposeFix (default false — requires user confirmation before opening a fix branch).

Draft PR From Branch — One-Command Push + Generate + Open — a single command that replaces the three-step manual dance most users do today (git push -u origin HEAD + craft title/body + gh pr create). SideCar: Create Pull Request runs git push -u origin HEAD (with a pre-flight branch-protection check — see below), invokes the existing local summarizePR path against the commit range since the base branch’s divergence point (git merge-base) to produce a title + body, opens a preview for the user to edit, then calls POST /repos/:owner/:repo/pulls. Draft by default: PRs are opened as drafts (draft: true) so they don’t spam reviewer queues before the author’s had a last look; a one-click Ready for review follows the existing github tool pattern. Template awareness: when .github/pull_request_template.md or .github/PULL_REQUEST_TEMPLATE.md exists, it’s loaded and its sections are filled in section-by-section by the model (not overwritten wholesale — preserves H2 headings the template declares). Configured via sidecar.pr.create.draftByDefault (default true), sidecar.pr.create.baseBranch (default auto-detected from HEAD’s upstream-tracking or origin/HEAD), and sidecar.pr.create.template (auto ignore absolute path, default auto).

Branch Protection Awareness — Pre-Push Status-Check + Required-Reviewer Warnings — prevents the common “pushed straight to main, failed CI, got chased by the team” footgun. Before any git push / git_push tool call against a branch, SideCar queries /repos/:owner/:repo/branches/:branch/protection (authenticated) and /repos/:owner/:repo/commits/:sha/check-runs to find required status checks + required reviewer counts. If the branch is protected AND the push target doesn’t satisfy the required checks OR lacks the required approvals, a modal surfaces the gaps (“main requires status checks ci/lint and ci/test; only ci/lint has passed on this commit. Required reviewer count is 2; you have 0 approving reviews.”) with Proceed / Cancel. The warning is skipped for unprotected branches and for the user’s own feature branches. Composes with Draft PR: the Create Pull Request flow runs this check against the base branch at submit time and warns that the PR can’t merge until checks/reviewers are satisfied — sets expectations before the author waits on CI. Configured via sidecar.pr.branchProtection.enabled (default true), sidecar.pr.branchProtection.warnEvenIfPassing (default false — turns on a soft reminder even when checks pass so the user sees what’s required).
Process Lifecycle Hardening — ManagedChildProcess + Registry + Orphan Sweep — closes the real-world failure mode where VS Code window reload or abrupt IDE close strands child processes spawned by SideCar (MCP stdio servers, the ShellSession persistent shell, custom-tool wrappers, future background workers). Current state is better than many extensions — MCPManager, ShellSession, EventHookManager, ToolRuntime, and Scheduler all implement dispose() and are pushed into context.subscriptions so the VS Code lifecycle drives teardown — but three gaps bite under real conditions. (1) MCPManager.disconnect() awaits the SDK’s client.close() with no timeout (mcpManager.ts:420-424); a stdio server whose stdin handler blocks means close() hangs forever, VS Code’s own deactivate timeout force-kills the extension host, and the child process gets reparented to init (Linux) or abandoned (macOS). (2) Activation assumes a clean slate — there is no detection of “I rebooted because VS Code crashed, there’s a stale mcp-server python process still bound to port 9000.” (3) HTTP/SSE MCPs that bind local ports leave the port in TIME_WAIT or held by the orphan; new sessions fail to connect with a confusing error. This entry introduces a unified lifecycle primitive across every spawn site. ManagedChildProcess wrapper at src/agent/processLifecycle.ts standardizes every spawn: enforces detached: false so SIGTERM propagates on parent death, pipes stdio (never inherits) so descriptors close cleanly, registers PID + spawn signature into a ProcessRegistry on start, emits typed lifecycle events (spawned / closed / killed / timeout) observable from tests, and provides one canonical close chain: graceful close (await provided cleanup fn) → 2s timeout → SIGTERM → 1s → SIGKILL. The chain is deterministic — worst-case 3s per child, parallelizable across N children, so dispose() on the extension has a bounded cost VS Code can honor. ProcessRegistry singleton pushed into context.subscriptions at the top of activation; every spawn site (MCP StdioClientTransport, ShellSession, AgentTerminalExecutor where applicable, custom-tool wrappers, future HTTP-bound servers) routes through the registry rather than calling child_process.spawn directly. Registry-level dispose triggers the close chain for every live PID in parallel, respecting the 3s budget. Per-session PID manifest at .sidecar/pids.json (gitignored, one line per PID: { pid, command, args, cwd, spawnedAt, expectedPort?, sessionId }). Append on spawn, remove on clean exit, rotate on activation after the sweep completes. Startup orphan sweep reads the manifest from the prior session (if any) and for each listed PID: (a) probe liveness via process.kill(pid, 0) (throws ESRCH when gone); (b) if alive, verify the process cmdline matches the stored spawn signature by reading /proc/<pid>/cmdline on Linux / ps -o command= -p <pid> on macOS — protects against killing an unrelated PID that got recycled to the same number; (c) if ours and still alive, run the SIGTERM → SIGKILL chain. Sweep runs in parallel; results surface in the activation log ([SideCar] Cleaned 2 orphan MCP processes from prior session) and as a SideCar: Show Orphan Sweep Report command for users who want the detail. Port-lock sweep for HTTP/SSE MCPs: when a configured URL points at localhost:<port> and a pre-bind probe finds the port already in use, look up the owner via platform-specific tooling (lsof -i :<port> -t on macOS/Linux, netstat -ano | findstr :<port> on Windows), check whether the owner PID is in our prior manifest — if yes, kill it and retry bind; if no, surface a clear error asking the user to free the port before continuing. MCPManager integration: disconnect() still calls the SDK’s client.close() first (gives the protocol a chance to exit gracefully) but in a Promise.race against a per-server timeout (default 2000ms, configurable via sidecar.mcpServers.<name>.closeTimeoutMs). On timeout, the underlying ManagedChildProcess takes over with SIGTERM. Window reloads that previously orphaned a stdio server now complete within 3s with zero survivors. Composes with Shadow Workspaces: the git worktree dispose() path also runs through the registry so abandoned worktrees from crashed sessions get swept alongside process orphans. Composes with Audit Mode: a new process_lifecycle audit event fires on every sweep (orphan killed, timeout triggered, kill chain completed) so admins auditing team environments can see the signal. Configured via sidecar.processLifecycle.enabled (default true; setting false falls back to today’s best-effort dispose with a warning), sidecar.processLifecycle.closeTimeoutMs (default 2000, clamped 500–10000), sidecar.processLifecycle.killTimeoutMs (default 1000, clamped 200–5000), sidecar.processLifecycle.orphanSweep.enabled (default true), sidecar.processLifecycle.orphanSweep.reportOnActivation (default true — surfaces a toast when ≥1 orphan was cleaned; set false for headless / CI environments), and sidecar.processLifecycle.portSweep.enabled (default true). Explicitly out of scope: process isolation sandboxing (cgroups, namespaces — OS-specific, belongs to later security work), resource quotas (CPU/RAM caps per child — vision-shelf item), cross-machine PID tracking for Dev Containers / SSH extension hosts (VS Code’s own lifecycle handles these — the extension host PID is the meaningful one, and VS Code kills it on disconnect).
Hook Execution Hardening — Streaming spawn + activity-adaptive timeouts + unified env sanitization — closes three real failure modes in the two hook systems SideCar ships today. Current state: both sidecar.hooks (per-tool pre/post at executor.ts:816-862) and sidecar.eventHooks (onSave/onCreate/onDelete at eventHooks.ts:83-108) wrap execAsync — Node’s exec with a promise adapter. Both enforce a fixed 15s timeout. eventHooks.ts has a local sanitizeEnvValue() that strips control characters (null bytes, newlines, ESC sequences) from SIDECAR_FILE; sidecar.hooks applies redactSecrets() to SIDECAR_INPUT/SIDECAR_OUTPUT but does not strip control characters, so the two hook systems have inconsistent defenses against the same injection class. Three gaps this entry closes: (1) exec buffer overflow — Node’s exec defaults to a 1 MB stdout cap; any hook producing more (a verbose test suite, a lint run with hundreds of findings, a Python script with a big traceback) crashes the hook with stdout maxBuffer length exceeded and the agent loop sees a generic failure rather than the actual hook output. (2) fixed timeout with no adaptivity — a slow-but-working npm test post-hook legitimately takes 45 seconds on a mid-size project; at 15s it gets killed even though stdout is streaming test progress the whole time. The agent loop interprets the timeout as a hook failure and either blocks (pre-hook) or warns (post-hook) when the hook was actually doing exactly what it should. (3) inconsistent env sanitization — SIDECAR_INPUT in sidecar.hooks can contain raw filename or tool-argument content with embedded ESC sequences or newlines that, under a bash -c "echo $SIDECAR_INPUT" pattern, bleed into the shell’s handling of the variable. redactSecrets() catches credential-shaped content but doesn’t normalize control chars. Unified hookRunner.ts replaces execAsync in both sites. Uses child_process.spawn with piped stdio; reads stdout + stderr in chunks via data listeners; accumulates into a bounded ring buffer with explicit truncation semantics (default 10 MB cap via sidecar.hooks.maxOutputBytes, configurable; on overflow, drops the middle and keeps head + tail with a [... N bytes elided] marker, same pattern the existing prompt pruner uses for tool_result blocks); surfaces the truncated-but-complete output to the caller on exit. Activity-adaptive timeout: initial budget from sidecar.hooks.timeoutMs (default 15000), a monotonic clock starts at spawn, each data event from stdout or stderr resets a per-activity timer to sidecar.hooks.extendOnActivityMs (default 5000). Hook is killed when either (a) initial budget elapses with zero output activity, or (b) total elapsed exceeds sidecar.hooks.maxTimeoutMs (default 300000, 5 min hard cap). A fast hook completes well under 15s; a slow-but-working hook that produces output every few seconds runs to completion up to the 5 min hard cap; a truly hung hook that goes silent gets killed at the initial 15s boundary. Configurable, but defaults are tuned for the common cases: lint/format run quickly, test suites take minutes with streaming output. Unified sanitization: extract sanitizeEnvValue() from eventHooks.ts into src/agent/envSanitize.ts (new module, exports a single pure function) and apply it to every hook env var in both hook systems — SIDECAR_TOOL, SIDECAR_INPUT, SIDECAR_OUTPUT, SIDECAR_FILE, SIDECAR_EVENT. redactSecrets() still runs on top of sanitization for credential content. Same defense surface applied uniformly; fixes the inconsistency where eventHooks was hardened but tool-hooks weren’t. Hook children route through ManagedChildProcess (the Process Lifecycle Hardening primitive in the paired spec above) — so a hook that slips past every timeout and VS Code force-kills the extension host still gets cleaned up on next activation via the orphan sweep. Same registry, same PID manifest, same disposal guarantees. Composes with Audit Mode: the event_hook:<event> audit entry already exists; this entry extends it with the new truncated, killedBy: 'idle-timeout' | 'hard-cap' | 'caller-abort', and bytesReceived fields so /audit queries surface “hook was killed for going silent too long” vs. “hook produced 10 MB of output and was truncated” distinctly. Composes with Regression Guards: guard commands also use the hook-runner substrate so guards with streaming output (a long-running fuzz test, a numerical-invariant sweep) benefit from activity-adaptive timeouts too, without separate plumbing. Configured via sidecar.hooks.maxOutputBytes (default 10_000_000, clamped 1_000_000–104_857_600), sidecar.hooks.timeoutMs (default 15000, clamped 1000–60000 — the initial silent-budget), sidecar.hooks.extendOnActivityMs (default 5000, clamped 1000–60000 — the per-chunk extension), and sidecar.hooks.maxTimeoutMs (default 300000, clamped 15000–1800000 — the absolute hard cap). Ships in v0.70 as part of the runtime-correctness pass, paired with Process Lifecycle Hardening.
Bitbucket / Atlassian — Bitbucket REST API, GitProvider interface, auto-detect from remote URL
OpenRouter — dedicated integration with model browsing, cost display, rate limit awareness → shipped 2026-04-15 in v0.53.0. Dedicated OpenRouterBackend subclass with referrer + title headers, rich catalog fetch via listOpenRouterModels(), first-class entry in BUILT_IN_BACKEND_PROFILES, and a runtime MODEL_COSTS overlay populated from OpenRouter’s per-model pricing (no more hand-maintaining prices for hundreds of proxied models). Per-generation real cost tracking via /generation/{id} still deferred.
Browser automation — Playwright MCP for testing web apps
Extension / plugin API (vision-shelf — superseded by @sidecar/sdk above) — the original bullet described the intent; the spec above is the concrete v0.73 implementation.
Agentic Task Delegation via MCP — elevates MCP from a static tool registry into a dynamic sub-agent orchestration layer. Instead of treating every MCP server as a dumb function call, SideCar can spawn specialised servers on-demand (e.g. a math-engine for symbolic computation, a web-searcher for live retrieval, a code-executor sandbox) and route sub-tasks to them as first-class agents with their own reasoning loop. The lead agent decomposes the user’s request, dispatches sub-tasks to the most capable server via a new delegate_to_mcp tool call, collects structured results, and synthesises a final response — mirroring the hierarchical multi-agent pattern but using the MCP protocol as the inter-agent transport. Server lifecycle (spawn, health-check, teardown) is managed automatically, and each delegation is recorded in the audit log with the server name, input, output, and latency. Configurable via sidecar.mcpDelegation.enabled and sidecar.mcpDelegation.allowedServers.
Voice input (shipped v0.98.0) — microphone button in chat UI. Audio recorded in the VS Code extension host (Swift/AVFoundation on macOS, arecord on Linux, PowerShell+WinMM on Windows — no browser window). Transcribed locally via @huggingface/transformers Whisper or via any HTTP Whisper endpoint. Gated by sidecar.voice.enabled.

Enterprise & Collaboration

Centralized policy management — .sidecar-policy.json for org-level enforcement of approval modes, blocked tools, PII redaction, provider restrictions
Multi-User Agent Shadows — a shared agent knowledge base that lets every contributor’s SideCar instance start with the same learned project context. A team member runs SideCar: Export Project Shadow to serialise the agent’s accumulated knowledge — coding standards, design tokens (colors, typography), mathematical definitions, architectural decisions, naming conventions — into a versioned .sidecar/shadow.json file that is committed to the repo. When a new contributor opens the project, SideCar detects the shadow file and automatically imports it into their local memory store, so their instance already knows the project’s conventions without a single prompt. Entries are namespaced by category (standards, design, math, architecture) and can be individually pinned or overridden locally. Shadow exports are human-readable JSON so they can be reviewed and edited in PRs like any other config file. Controlled via sidecar.shadow.autoImport (default: true) and sidecar.shadow.autoExport (default: false — export is always an explicit user action to avoid leaking sensitive context).
Team knowledge base — built-in connectors for Confluence, Notion, internal docs
Real-time collaboration Phase 1 — VS Code Live Share integration (shared chat, presence, host/guest roles)
Real-time collaboration Phase 2 — shared agent control (multi-user approval, message attribution)
Real-time collaboration Phase 3 — concurrent editing with CRDT/OT conflict resolution
Real-time collaboration Phase 4 — standalone @sidecar/collab-server WebSocket package

Technical Debt

Config sub-object grouping (30+ fields → sub-objects)
Real tokenizer integration (js-tiktoken for accurate counting)