Feature Specifications
Feature Specifications
Detailed specifications for every entry in the SideCar release plan. Each entry describes the problem, the mechanism, integration points, and the configuration surface. Organized thematically.
Context & Intelligence
-
SIDECAR.md Retrieval-Mode — Semantic Section Scoring in the Fusion Pipeline — the retrieval-based successor to path-scoped injection above. Once the
parseSidecarMdprimitive (v0.67) has landed, this entry layers aSidecarMdRetrieveron top that joins the existingfuseRetrieverspipeline (DocRetriever/MemoryRetriever/SemanticRetriever). Why retrieval is better for some workspaces: path-scoped routing assumes users know which sections apply to which paths, and they annotate accordingly. Large projects with organically-grown SIDECAR.md files (50+ sections, inconsistent heading naming, overlap between section scopes) often don’t — and asking the model “which sections of this doc are relevant to the question I’m asking right now?” is exactly the problem retrieval is good at. Mechanism: on workspace init, every section body is embedded with the sameall-MiniLM-L6-v2model used elsewhere in the retrieval stack and stored in a namespaced LanceDB table (or the flat fallback) at.sidecar/cache/sidecarMd/. On each turn, the retriever scores sections against the fused query (user message + active file path + recent tool_result summaries) via cosine similarity, applies RRF against the other retrievers, and surfaces the top-K as[SIDECAR.md · §<heading>]-tagged hits in the fused context block. Incremental updates: same pattern as Project Knowledge Index —fs.watchonSIDECAR.mdtriggers re-parse + per-section hash compare; only changed sections re-embed. Saves survive across sessions. Hybrid with path routing:sectionsandretrievalcompose naturally. When both are enabled, the path-scopedalwayssections are always included verbatim (cost = no tokens consumed by retrieval scoring for always-sections), and the retriever scores only thescoped+lowpool against the current query to pick top-K. That preserves the deterministic “Build” / “Conventions” inclusion while letting retrieval surface the right “Transforms” section without relying on a path glob the author never wrote. Faithfulness audit via the existing RAG-eval harness: a new golden-case fixture atsrc/test/retrieval-eval/sidecarMdGolden.tsasserts that for a query like “how do I add a new transform kernel?” the retrieved section is the one tagged## Transformsand NOT the## Databasesection that happens to share the word “index.” Failures become CI regressions the same way other retrieval quality regressions surface. Composes with Dense-Repository Context Mode: when a domain profile is active (e.g.physics,signal-processing), the profile’spreserveRegexpatterns boost sections containing matching text — the## Invariantssection containingepsilon_0 = 8.854e-12always scores on physics profile turns, even if the user’s immediate query doesn’t say “epsilon.” UI: a SIDECAR.md index health line in the existing observability surface showing indexed sections + disk footprint + last-update-time; a/sidecarmd preview <query>slash command that runs a dry retrieval against an arbitrary query so users can debug why a section isn’t surfacing. Configured viasidecar.sidecarMd.mode: 'retrieval'(opt-in),sidecar.sidecarMd.retrieval.topK(default5, clamped 1–20),sidecar.sidecarMd.retrieval.minScore(default0.3— sections below this threshold never surface even if they’re in top-K, prevents forced-include on doc-light projects), andsidecar.sidecarMd.retrieval.alwaysIncludeHeadings(shared with section-mode — these bypass retrieval scoring and inject verbatim). Roadmap slot: v0.70+ as a retrieval-infrastructure beat, after the path-scoped primitive has been in production long enough to measure the gap retrieval needs to close. -
Scheduled Task Concurrency Safety — Shadow Routing + DocumentConcurrencyGate + Deferral Queue — closes the real failure mode where a
sidecar.scheduledTasksentry fires mid-keystroke: a 2 AM lint/refactor task fires at noon while the developer is actively editing a target file, the agent writes throughfs.tstools directly against the main tree, and aWorkspaceEditlands on a file with an unsaved buffer and an active cursor. The result is catastrophic UI stutter, cursor jump, lost edits if the user saves over the agent’s change — exactly the class of bug that destroys developer trust in background automation. Current state: scheduler.ts:46-70 runs tasks withapprovalMode: 'autonomous', no Shadow Workspace routing, no dirty-buffer check anywhere in the path. The agent happily writes tosrc/foo.tswhile the user is typing insrc/foo.ts. This entry fixes it in three layers. (1) Force Shadow Workspace on every scheduled run: scheduler wraps the agent invocation with{ forceShadow: true, deferPrompt: true }— reuses the v0.66 primitive Facets already use. Writes land in.sidecar/shadows/scheduled-<task-id>/during the run; main tree is untouched for the entire 30-minute lint pass. The user’s active editor at<workspace>/src/foo.tscannot possibly be affected by writes to<workspace>/.sidecar/shadows/<id>/src/foo.ts— different paths, different cwd, theTextDocumentVS Code knows about doesn’t care. (2)DocumentConcurrencyGateat apply time: new primitive atsrc/agent/documentGate.tsexposescheckPathSafe(absolutePath): { safe: boolean; reason?: 'dirty' | 'active' | 'ok' }. Called once per touched path before the shadow’s diff is applied to main. Looks up the path inworkspace.textDocuments.find(d => d.uri.fsPath === absolutePath)— if the document is open ANDisDirty, returns{ safe: false, reason: 'dirty' }. Separately checkswindow.activeTextEditor?.document.uri.fsPath === absolutePath— if the user is actively in that file, returns{ safe: false, reason: 'active' }. Clean + inactive files apply immediately via the existingGitCLI.applyPatchpath; deferred files queue. (3) Persistent deferral queue:.sidecar/scheduled/pending.jsonl(gitignored, append-only) stores{ taskId, taskName, targetPath, pendingDiff, queuedAt, reason }per deferred entry. Survives extension reload and VS Code restart. On activation, the queue is replayed: for each entry, re-checkcheckPathSafeand apply if now safe. AonDidSaveTextDocument+onDidChangeActiveTextEditorlistener fires the same re-check whenever an affected path becomes available — saves drain the queue silently without any foreground interrupt. Cross-task mutex per path: if task B completes with a pending diff forsrc/foo.tswhile task A’s diff for the same file is still queued, the queue serializes them inqueuedAtorder — the newsrc/agent/lockPrimitives.tsmodule (extracted from the existing fileLock.ts FIFO pattern) gates the apply so task B applies onto the post-task-A result rather than racing. If task A’s apply fails (patch conflict with a user edit that landed in the interim), task B still gets its own attempt against the post-user-edit file; failures surface individually. Staleness surface: if a queued entry has been pending longer thansidecar.scheduledTasks.staleWarningMinutes(default60, set to0to disable), a single notification surfaces (“2 scheduled task results are waiting onsrc/foo.ts. Review now?”) that opens the existing facet-review UI from v0.66 — it’s already diff-aware, per-file accept/reject, integrates withvscode.diff, and requires no new UI code. Task-level versus path-level deferral: if ANY target in the task’s diff is unsafe, the task’s diff is deferred whole — we don’t partial-apply a multi-file refactor because half the files were clean. Preserves atomicity; all-or-nothing matches how Audit Mode and Facets already reason about batches. Composes with Audit Mode: when Audit Mode is active, scheduled-task applies route through the audit buffer the same way agent writes do — the user sees “3 scheduled changes awaiting review” in the existing Audit tree, accept/reject applies or discards atomically. Composes with Regression Guards: pre-completion guards run against the shadow before the apply gate, so a guard that detects a broken invariant can block the scheduled task’s apply even when the path is clean and inactive — no “silent 2 AM commit that breaks the build.” Composes with Shadow Sweep: the existing v0.62.1 shadow-sweep already cleans abandoned.sidecar/shadows/entries from crashed sessions; scheduled-task shadow IDs use the samescheduled-<task-name>-<timestamp>namespace so sweep handles them uniformly. Configured viasidecar.scheduledTasks.forceShadow(defaulttrue; escape hatchfalserestores pre-v0.69 direct-write behavior with a one-time warning),sidecar.scheduledTasks.gateOnDirty(defaulttrue),sidecar.scheduledTasks.gateOnActive(defaulttrue),sidecar.scheduledTasks.staleWarningMinutes(default60), andsidecar.scheduledTasks.maxQueuedBatches(default50; hard cap against runaway queue if a path stays dirty for days — oldest batches drop silently with an audit-log entry). Explicitly out of scope: per-character edit merging (that’s git’s job viaapplyPatchconflict detection), real-time collaborative editing (different problem class), cross-user scheduled-task coordination in team environments (different problem class), preemptive abort of a scheduled-task run when the user opens a target file mid-run (the shadow isolates writes already, preemption adds complexity for marginal benefit — let the run finish in its shadow, gate the apply). - Multi-repo cross-talk — impact analysis across dependent repositories via cross-repo symbol registry
- Semantic Agentic Search for Monorepos — cross-repository memory backed by a dedicated MCP server that indexes multiple local folders simultaneously into a unified vector store. The agent can answer questions like “does the algorithm in
repo-amatch the implementation inrepo-b?” by running a semantic diff across both indices, surfacing divergences, stale copies, and interface mismatches in a single response. Each root is indexed independently so adding or removing a repo doesn’t invalidate the others. Configured viasidecar.monorepoRoots(array of absolute paths) and exposed as asearch_repostool the agent calls automatically when a prompt references multiple packages. A Repo Index status-bar item shows live indexing progress per root. -
Dense-Repository Context Mode — Domain Profiles + Invariant-Aware Retention — closes the gap that remains after graph-expanded retrieval ships in v0.65: for deeply-interconnected codebases like electromagnetics simulations, signal-processing engines, and extensive transform libraries, the agent needs not just “pull in the callers” but “keep the load-bearing constants, equations, and physical units from being evicted when the turn gets long.” Today’s compression layer (src/agent/loop/compression.ts) prunes tool_results and old turns by character count — zero awareness of whether a truncated line contained
epsilon_0used in twelve other files, the Maxwell-equation block that the next three write_file calls must stay consistent with, or the sample-rate constant that propagates through every DSP function. This entry introduces structural awareness to both retrieval and pruning. Domain profiles live as declarative markdown with frontmatter at.sidecar/profiles/<name>.md(opt-in, path configurable viasidecar.domainProfiles.registryPath, default.sidecar/profiles/); a profile declares retrieval policy (graphWalkDepth,prioritizeglobs for.m/.py/.tex/.cpp/.f90), invariant patterns to preserve under pruning (preserveRegex: ["\\\\b(epsilon|mu|c|h|k_B)_?0?\\\\b", "\\\\\\\\frac\\\\{[^}]+\\\\}", "const\\\\s+\\\\w+\\\\s*=\\\\s*[0-9]"]), kind priorities (physics.mdbooststype,function,const;signal.mdboostsfunctionwith names matchingfft|dct|dwt|filter|transform), and token-budget hints (reservedForInvariants: 500— a floor carved out of the retrieval budget so invariant lines always get a seat even when the rest of context is hot). Built-in profiles ship forphysics,signal-processing,transforms-and-kernels,numerical-methods, andcontrol-systems; users copy and customize under the same directory. Activated per-workspace viasidecar.domainProfiles.active(string array — profiles compose, e.g.["signal-processing", "physics"]for an EM-simulation repo) or per-prompt via@profile:physicssentinel. Symbol-level importance score layered onto the existing Project Knowledge Index: every symbol gets a precomputed importance value from(fanIn × 0.4) + (referenceCount × 0.3) + (matchesPreserveRegex × 0.3), persisted in the Merkle store next to the embedding. High-importance symbols are exempt from low-priority eviction. When compression needs to free N chars from a tool_result or code snippet, it reads importance scores for every line’s containing symbol and elides the lowest-scoring first — a tool_result containingepsilon_0 = 8.854e-12stays; the surrounding debug print statements drop. Invariant-aware summarization extends ConversationSummarizer: when an old turn references a preserved-regex hit (say, the Maxwell-equation block), the summarizer replaces the surrounding prose but keeps the equation verbatim as a quoted block. The summary reads “In turn 3, we discussed the divergence of E; the form referenced was:∇·E = ρ/ε₀.” Model sees the summary AND the exact invariant — no drift. Small-context adaptation is the scenario this was designed for: on a 4K local model where every token counts, domain profiles become more valuable, not less, because the profile’sreservedForInvariantsfloor converts “random character truncation” into “keep the physics, drop the narration.” The retriever + pruner consult profile config whenevercontextLength < 16Kand tighten the filtering accordingly. Reference graph surfacing surfaces cross-file numeric-constant coupling as first-class hits: a newfind_shared_constants(symbol)agent tool walks the symbol graph plus a lightweight constant-use index (maintained by a tree-sitter visitor that flagsconst/static const/final/Parameterdeclarations), and returns every file that depends on a specific named value — so “before you changeSAMPLE_RATE, here are the 12 files that use it” becomes a pre-edit check the agent runs automatically when edit_file targets a file matchingpreserveRegex. Cross-invariant validation at completion-gate time: a new post-turn hook extracts numeric literals and named constants from everywrite_file/edit_filethe turn produced, cross-references them against the invariant set, and flags divergence (“MU_0declared as1.257e-6infields.pyline 12 but1.256e-6inwaves.pyline 38 — which is correct?”). Guards against the class of physics/math bug where two “agreeing” files silently disagree on the fourth decimal. Composes with every earlier retrieval entry: Project Knowledge Index is where the importance scores live + the Merkle layer reuses them as extra metadata for subtree selection; Memory Guardrails becomes “pin the profile’s invariant set by default” rather than manually picking constants; Semantic Time Travel answers “when didepsilon_0last change?” in O(diff) via Merkle; Multi-repo cross-talk checks for constant agreement ACROSS repos (sameGRAVITYvalue inplanetary_sim/andorbit_mechanics/? the tool flags the drift). Profile discovery:/profile suggestanalyzes the workspace (file extensions, import graph heuristics, presence of.tex/numpy/scipy/eigen), surfaces the top 1-3 matching built-in profiles with a one-click accept, writes the chosen profile(s) tosidecar.domainProfiles.active, and begins tracking. Output verbosity: the Retrieved Context block in the system prompt gains a “Preserved by domain profile” section tagged[profile: physics]so the model sees which lines are invariant-floor vs. standard retrieval hits. Configured viasidecar.domainProfiles.enabled(defaultfalse— opt-in per workspace; activating a profile auto-flips this),sidecar.domainProfiles.registryPath(default.sidecar/profiles/),sidecar.domainProfiles.active(string array — profiles compose in declared order, later profiles override earlier on conflict),sidecar.domainProfiles.autoDetect(defaulttrue— on first activation, run/profile suggestand prompt the user),sidecar.domainProfiles.reservedForInvariantsFloor(override floor applied to every active profile, default0= use each profile’s own value),sidecar.domainProfiles.crossInvariantValidation(defaulttrue), andsidecar.domainProfiles.sharedConstantsTool(defaulttrue). Pairs naturally with the v0.65-shipped graph-expanded retrieval (which this entry treats as the foundation) and the Project Knowledge Index’s Merkle + importance scoring. -
Project Knowledge Index — Symbol-Level Vectors + Graph Fusion in an On-Disk Vector DB — upgrades the shipped EmbeddingIndex (which today stores one 384-dim all-MiniLM-L6-v2vector per file in a flatFloat32Arrayat.sidecar/cache/embeddings.binwith a JSON metadata sidecar and a linear cosine scan at query time) into a Pro-grade codebase intelligence layer that stays entirely on disk, answers global questions, and models relationships — not just text matches. The gap this closes is best illustrated by the canonical repo-awareness question “where is the auth logic handled?”: the current flat index returns files whose text happens to mention “auth” somewhere, which usually means the middleware file is found but the routes that use it without saying “auth” are missed, and on a 10k-file repo the linear scan is slow enough to be noticeable. Copilot Pro answers this well because it indexes at symbol granularity and understands the call graph; this entry brings the same capability on-disk and local-first. Three layered changes: (1) Proper on-disk vector store via embedded LanceDB — a Rust-native columnar vector DB with a Node binding, HNSW indexes for sub-ms ANN over millions of vectors, metadata filtering (query “auth” only insidesrc/middleware/**), atomic writes, and zero external processes to manage. LanceDB is chosen over ChromaDB because Chroma’s Node support goes through a Python subprocess, which is a deployment footgun in a VS Code extension; LanceDB ships as a single.nodebinary with no runtime dependencies. Storage lives at.sidecar/cache/lance/(already covered by the gitignored-subdirs carve-out). (2) Symbol-level chunking replaces one-vector-per-file — every function, class, method, interface, and significant top-level comment block becomes its own indexed chunk. The existing symbolGraph.ts already runs tree-sitter over the workspace and knows symbol boundaries, so it becomes the chunker: eachSymbolNodeproduces one vector from its body text plus docstring, tagged with{ filePath, range, kind, name, containerSymbol, hash }. Granularity goes from thousands of file-vectors to hundreds of thousands of symbol-vectors; retrieval returns the specific function, not the whole file. (3) Graph-walk retrieval closes the “middleware vs routes” gap — after the initial vector hit, the retriever walks the symbol graph’s typed edges (defines,calls,imports,used-by) up tosidecar.projectKnowledge.graphWalkDepth(default2) and surfaces symbols reachable from the hit even when their text doesn’t match the query. So “where is auth handled?” retrievesrequireAuthmiddleware via vector similarity, then walks theused-byedges to return every route handler that wraps it — without those routes needing to say the word “auth.” The walk is budgeted (breadth-first up tomaxGraphHits, default10) so a popular symbol likelogger.infocan’t drown the result list. Incremental updates: VS Code’sonDidChangeTextDocument/onDidCreateFiles/onDidDeleteFiles/onDidRenameFilesevents drive re-embedding of only the changed symbols (not the whole file), resolved by content-hashing each symbol’s body — unchanged symbols keep their cached vectors so a one-line edit in a 2000-line file costs one re-embed, not 200. Rename events move the vector metadata instead of re-embedding. A background queue with 500ms debounce + 30s persist-to-disk matches the existing pattern at embeddingIndex.ts:24-25. New agent tool:project_knowledge_search(query, { maxHits?, graphWalkDepth?, kindFilter?, pathGlob? })returns structured{ symbol, filePath, range, score, relationship }[]withrelationshiptagging whether each hit was a direct vector match or reached via graph walk ("vector: 0.82","graph: used-by → 2 hops from requireAuth"), so the model sees why each result surfaced and can weight accordingly. Migration from the flat index is transparent: on first activation with the new backend, the existing.sidecar/cache/embeddings.binis read, re-chunked to symbol-level, and ingested into LanceDB; the old file is kept for one version as a rollback safety net, then deleted. UI: a Project Knowledge sidebar panel shows index health (symbols indexed, last update time, vector count, disk footprint), a rebuild-from-scratch button for pathological cache states, and a search box that exposes the sameproject_knowledge_searchtool for the user to query interactively. Composes with every earlier retrieval entry: SemanticRetriever in the fusion pipeline now queries symbols rather than files (hits are smaller and more precise, so RRF competes them more fairly against doc and memory hits); Semantic Time Travel uses per-commit LanceDB snapshots at.sidecar/cache/lance/history/<sha>/; Memory Guardrails pins entries go in the same store with apinned: truemetadata flag and a filter that always includes them regardless of score; the Semantic Agentic Search for Monorepos entry becomes “N LanceDB tables queried in parallel” — same code path, different roots. Configured viasidecar.projectKnowledge.enabled(defaulttrue),sidecar.projectKnowledge.backend(lanceflat, defaultlance;flatpreserves the current behavior for users on constrained platforms where the native binding won’t load),sidecar.projectKnowledge.chunking(symbolfile, defaultsymbol),sidecar.projectKnowledge.graphWalkDepth(default2),sidecar.projectKnowledge.maxGraphHits(default10),sidecar.projectKnowledge.indexPath(default.sidecar/cache/lance/),sidecar.projectKnowledge.maxSymbolsPerFile(default500— guard against generated files with 50k symbols), andsidecar.projectKnowledge.embedOnSave(defaulttrue; setfalsefor manual rebuild only).flowchart TD Q[Query: 'where is auth handled?'] --> E[Embed query<br/>all-MiniLM-L6-v2] E --> ANN[LanceDB HNSW search<br/>sub-ms ANN over<br/>symbol vectors] ANN --> V[Vector hits<br/>e.g. requireAuth middleware] V --> GW{Graph walk<br/>depth ≤ 2} GW -->|used-by edges| R1[Route handlers<br/>wrapping requireAuth] GW -->|calls edges| R2[Called helpers<br/>verifyToken, etc.] GW -->|imports edges| R3[Modules importing<br/>the middleware] V & R1 & R2 & R3 --> RANK[Rank + tag by<br/>relationship path] RANK --> OUT[Structured hits:<br/>symbol, filePath, range,<br/>score, relationship] subgraph Updates W[onDidChangeTextDocument] --> H[Hash changed symbols] H --> D{Diff vs cached} D -->|changed| RE[Re-embed only<br/>changed symbols] D -->|unchanged| SKIP[Keep cached vector] RE --> UP[Atomic upsert<br/>to LanceDB] end -
Merkle-Addressed Semantic Fingerprint — Keystroke-Live Structural Index — layers a content-addressed Merkle tree over the Project Knowledge Index so change detection, integrity verification, and sync across sessions/machines become O(log n) instead of O(n), and re-embedding on a per-file save compresses to re-hashing on a per-keystroke basis with no latency cost. Current state honestly: EmbeddingIndex runs a 500ms debounced incremental update on onDidChangeTextDocument, re-embeds the whole file each time, persists as a flat binary every 30s. Works, but two things fall out of this: (a) large monorepos pay an index-walk cost for every query because there’s no hierarchy to prune with, and (b) “what changed since you were last here?” requires re-hashing everything because nothing is addressed structurally. This entry adds a Merkle layer that makes both of those sub-linear. The structure: every symbol-level chunk (already the granularity proposed in Project Knowledge Index) becomes a Merkle leaf with a content hashblake3(body ‖ path ‖ kind ‖ range)and its embedding as leaf metadata. Interior nodes aggregate their children’s hashes (blake3(child1 ‖ child2 ‖ …)) and also carry a mean-pooled aggregated embedding of their subtree, so the retriever can score whole subtrees at the interior level and skip them entirely without touching the leaves. The root hash is the repository’s semantic fingerprint — a single 32-byte string that changes iff any symbol in the workspace changed. Keystroke-live updates: VS Code’sonDidChangeTextDocumentfires on every edit with the modified ranges; the Merkle layer intercepts this and does the cheap work (re-hashing the containing symbol’s leaf, then the O(log n) parent chain up to the root) on every keystroke with no debounce — blake3 is fast enough that a 100-file-deep hash walk finishes in well under a millisecond. The expensive work (re-embedding) stays on a 300ms debounce because embedding is what actually takes ~20-50ms per chunk on-device — so the Merkle state is always current, the embedding state is eventually consistent within ~300ms, and the retriever can distinguish “this subtree is stale” (hash changed but embed hasn’t caught up — score with last-known embed, flag asstale: truefor honest UX) from “this subtree is fresh.” Where the latency comes from on a large monorepo — at query time the retriever walks down the tree: compute query embedding, compare against each of the root’s direct children’s aggregated embeddings, descend into the top-k subtrees, recurse. A workspace with 500k symbols becomes ~20 interior-level comparisons to narrow down to the top ~2k leaves, then an HNSW ANN search over those 2k (sub-ms in LanceDB). Total end-to-end latency: ~10–30ms on typical hardware even against a million-symbol index, which is the regime where “find that function three folders away” starts to feel instant rather than noticeable. Cache validity and sync become trivial byproducts of the root hash: on startup, SideCar recomputes the root over the current disk state (fast — just content hashes, no embeddings) and compares to the cached root; if they match, the whole index is reused as-is (no rebuild); if they differ, a tree walk finds exactly the changed subtrees and only those are re-embedded. The same mechanism gives cross-machine parity at trivial cost — Multi-User Agent Shadows’shadow.jsoncan include the Merkle root, so a teammate’s instance verifies index alignment in one 32-byte comparison and requests only the diff subtrees if misaligned. For Semantic Time Travel per-commit snapshots, unchanged subtrees dedup automatically (two commits that differ only insrc/utils/foo.tsshare every other subtree hash and therefore every other subtree’s cached embeddings) — a git-like compression ratio on the snapshot store without any custom encoding work. Lineage queries (/diff-since <commit-or-timestamp>) become a Merkle diff: two roots, descend into subtrees whose hashes differ, return the symbol-level changes — answerable in O(differences) rather than O(repo size), which is what makes “what changed since I was last here?” feel instant in sessions that span weeks. ~200-272k token context-window utilization: a frontier-model context window of this size is big enough to fit a small project outright, but for a 500k-symbol monorepo even 272k tokens is maybe 2% of the repo by token count, so the retriever’s job is to pick the 2% that matters. Merkle-addressed aggregated embeddings at interior nodes let the retriever select the most relevant subtrees first and materialize exactly as many as the context budget allows, with provably correct “you got the top-k subtrees for your budget” semantics rather than the current best-effort flat scan. Near-zero latency doesn’t come from precomputation alone — it comes from not having to walk most of the tree per query. Storage layout (.sidecar/cache/merkle/, covered by the gitignored-subdirs carve-out):tree.binfor the structure (parent/child pointers + hashes, mmapped),embeddings.lance/for leaf and interior-node vectors (the same LanceDB store from Project Knowledge Index, now with an extralevel: 0|1|2|…metadata column for interior-node rows),roots.logfor an append-only history of root hashes with timestamps so time-travel queries work without keeping full per-commit trees. Live root hash persists toroots.logeverysidecar.merkleIndex.rootSnapshotEveryMs(default10000, 10s) so a crash loses at most that interval of lineage data — the Merkle state itself is rebuildable from disk in ~seconds for any repo size. Integration with every earlier entry: Project Knowledge Index becomes the similarity layer and Merkle becomes the addressing layer (they compose — Merkle narrows candidate subtrees, LanceDB HNSW ranks within them); Semantic Time Travel stores per-commit roots instead of per-commit full indexes (dedup-heavy; a 500-commit history costs ~the same as 10 if the churn is low); Multi-User Agent Shadows syncs Merkle roots inshadow.jsonfor team index parity; Fork & Parallel Solve shows root-diff between forks as a structural summary of “what did each fork actually change” alongside the file diff; Model Routing can gate on change velocity (symbols under a high-churn subtree escalate to a more thorough model); Regression Guards can be targeted by subtree (a physics guard only fires when the touched symbols’ Merkle path containssrc/physics/**). Configured viasidecar.merkleIndex.enabled(defaulttruewhen Project Knowledge is enabled — they’re architecturally coupled),sidecar.merkleIndex.hashAlgorithm(blake3default for speed,sha256fallback for environments without a blake3 binding),sidecar.merkleIndex.liveUpdates(defaulttrue— hash on keystroke; setfalseto match the current 500ms-debounce-on-save behavior),sidecar.merkleIndex.rootSnapshotEveryMs(default10000),sidecar.merkleIndex.aggregationStrategy(mean-poolmax-poolattention-pool, defaultmean-pool—attention-poolis future work needing a trained head; mean-pool is the boring-and-correct default), andsidecar.merkleIndex.maxSymbolsForLiveHash(default50000— above this, fall back to debounced updates even in live mode because keystroke-rate hashing of a 500k-leaf tree becomes non-trivial even at blake3 speeds).flowchart TD subgraph Tree ["Merkle tree of symbols"] R["Root hash<br/>blake3 + mean-pooled<br/>aggregated embedding"] R --> S1["Subtree src/<br/>hash + agg-embed"] R --> S2["Subtree tests/<br/>hash + agg-embed"] S1 --> F1["File hash<br/>agg of symbols"] S1 --> F2["File hash<br/>agg of symbols"] F1 --> L1["Leaf: function authN<br/>content hash +<br/>384-dim embedding"] F1 --> L2["Leaf: class AuthMiddleware<br/>..."] end K[User keystroke] --> UL[Re-hash modified leaf] UL --> UP[Walk up O(log n)<br/>update ancestor hashes] UP --> R UL -.debounced 300ms.-> EM[Re-embed leaf] EM --> AGG[Recompute aggregated<br/>embeddings on ancestor path] subgraph Query ["Query path"] Q[Query embedding] --> QR[Compare vs root's<br/>direct children] QR --> DESC[Descend top-k subtrees] DESC --> HNSW[HNSW ANN<br/>over narrowed leaves] HNSW --> HITS[Ranked symbol hits] end
Editing & Code Quality
- Inline edit enhancement — ghost text overlay for single-cursor edits shipped in v1.0 (
src/edits/inlineEditProvider.ts). Remaining: extend towrite_filemulti-file edits, per-hunk syntax-highlighted diff, and batched edit streams. Deferred post-v1.0. -
Selective regeneration — “pin and regen” UI: lock good sections, regenerate only unlocked portions
-
Multi-File Edit Streams — DAG-Dispatched Parallel Writes — closes the Copilot-free-vs-Pro gap on wide refactors by letting the agent stream changes across N files at once instead of serializing them one at a time. The current loop already batches multiple tool_useblocks within a single assistant turn (the model can emitwrite_file src/a.ts+write_file src/b.tsin one message andexecuteToolUsesdispatches them together), but two gaps stop this from feeling like Pro-grade multi-file editing: (1) the agent rarely plans a multi-file edit up front — it tends to edit one file, wait to see the result, then decide the next edit, which serializes execution even when the edits are logically independent; and (2) the UI streams one diff preview at a time rather than N in parallel, so even batched writes feel sequential to the user. This entry addresses both. Up-front edit planning: when a task is large enough (sidecar.multiFileEdits.minFilesForPlan, default3), the loop inserts a mandatory Edit Plan pass before anywrite_filefires. The planner agent produces a typed manifest —EditPlan { edits: { path, op: 'create' | 'edit' | 'delete', rationale, dependsOn: path[] }[] }— and the runtime builds a DAG from thedependsOnedges. Independent nodes run in parallel up tosidecar.multiFileEdits.maxParallel(default8); edits with dependencies wait for their prerequisites (rename a symbol’s definition before editing the call sites). The plan surfaces in the chat UI as a collapsible Planned edits card the user can inspect — and amend via Steer Queue nudges like “skip src/legacy/, I’ll do those manually” — before execution starts, so the scope is transparent up front instead of discovered one file at a time. **Parallel streaming diff previews: the webview’s existingstreamingDiffPreviewFnpath is extended to handle N concurrent streams. A Pending Changes panel tile renders per in-flight file with its own live diff, chars-streamed progress bar, and per-file abort button; on an 8-wide edit the user sees all eight files populate simultaneously rather than watching them tick through one by one. Conflict detection at plan time, not write time — the DAG builder rejects plans with twoeditops targeting the same file (merged into one op with combined rationale) or with circular dependencies (the planner is asked to revise once, then surfaced as an error). Atomic review semantics: by default, the Pending Changes panel treats a multi-file plan as a single unit of work — accepting one file without the others can leave the codebase in a broken intermediate state (renamed definition + unrenamed call sites), so the default is accept-all or reject-all. Two escapes:sidecar.multiFileEdits.reviewGranularityset toper-fileexposes individual file checkboxes for advanced users who want surgical control, andper-hunkdrops down to hunk-level even across files. Integration with every earlier feature: all N streams land in the Shadow Workspace, so the main tree sees only the final bulk merge regardless of how many files are in flight; Regression Guards fire once against the full edit set rather than per-file, which is often what the user actually wants (a guard that only makes sense after the whole rename is done shouldn’t fail N−1 times during intermediate states); Audit Mode’s treeview shows N parallel buffered writes with per-file checkboxes matching the same granularity setting; Fork & Parallel Solve lets each fork contain its own multi-file plan for side-by-side comparison of wide-refactor strategies; Skills 2.0 can cap multi-file fanout per skill (a narrowtest_authorskill might setmax-parallel-edits: 1in its tool-budget). Planning-pass cost — adds one extra LLM turn before edits start, so the feature is opt-out-able when the user knows better (@no-plansentinel in the prompt skips the planner), and the planner can reuse a small local model viasidecar.multiFileEdits.plannerModel(default falls back to main model) since planning is structured-output-heavy and doesn’t need the full reasoning budget of the editing model. Configured viasidecar.multiFileEdits.enabled(defaulttrue),sidecar.multiFileEdits.maxParallel(default8),sidecar.multiFileEdits.planningPass(defaulttrue),sidecar.multiFileEdits.minFilesForPlan(default3— skip the planner for small edits),sidecar.multiFileEdits.plannerModel(default empty — reuses main model), andsidecar.multiFileEdits.reviewGranularity(bulkper-fileper-hunk, defaultper-file).flowchart TD U[User task<br/>span > 3 files] --> PL[Edit Plan pass<br/>planner model] PL --> PLAN[EditPlan manifest<br/>edits + dependsOn DAG] PLAN --> CARD[Planned edits card<br/>in chat UI] CARD -->|User nudge via Steer Queue| PL CARD -->|OK to proceed| DAG[Topological schedule] DAG --> PAR[Dispatch independent nodes<br/>up to maxParallel] PAR --> S1[write_file src/a.ts] PAR --> S2[write_file src/b.ts] PAR --> SN[...up to 8 streams] S1 & S2 & SN --> PC[Pending Changes panel<br/>N parallel diff previews] PC --> DEP[Dependent nodes fire<br/>after prereqs land] DEP --> PC PC --> GATE{Gate + Guards<br/>against full edit set} GATE -->|green| REV[Review: bulk /<br/>per-file / per-hunk] GATE -->|red| FB[Feedback to agent<br/>+ refine plan] REV --> M[Atomic merge to shadow] -
Zero-Latency Local Autocomplete via Speculative Decoding — pairs a tiny “draft” model (≤300M params, e.g.
qwen2.5-coder:0.5b,deepseek-coder:1.3b-distill, or the new generation of sub-B code drafts) with the user’s main FIM model (typically 7B–30B) and runs speculative decoding on the two in lockstep, amortizing the cost of the big model’s forward pass across k draft tokens per step. The existingcompleteFIMpath at client.ts:286 andInlineCompletionProviderat completions/provider.ts:79 stream the result straight into VS Code’s ghost-text surface; today this runs the big model alone and inherits its raw tok/s. With a well-matched draft pair on decent local hardware (RTX 4090 / M3 Max / 128GB+ unified memory), empirically observed speedups are 2–4× on code continuations where the draft’s guesses agree with the target most of the time — pushing a 30B coder from ~30 tok/s to ~80–120 tok/s, which crosses the perception threshold from “noticeably waiting” to “appearing as you type.” Target UX: autocomplete that feels like Copilot / Cursor Pro without the round-trip to a cloud provider and without ongoing token spend. Mechanism: draft generates k candidate tokens serially (cheap — the small model runs in microseconds per token), target verifies all k in a single parallel forward pass (one big-model step cost covers k tokens of throughput), accept the longest prefix where target’s argmax matches draft’s proposal, use the target’s token at the first disagreement, discard the rest of the draft. Rejection-sampled variant is supported for temperature>0 but default is greedy since autocomplete wants determinism. Backend integration: Ollama and Kickstand both back onto llama.cpp, which has native speculative decoding support (--draft-model,--draftparameters); the path is to surface this through the backend abstraction as a new optionaldraftModelfield onSideCarConfig, haveOllamaBackend.completeFIMpassdraft_modelto/api/generatewhen set, and haveKickstandBackend.completeFIMpass the equivalent to its OAI-compat endpoint. For backends that don’t expose speculative decoding (Anthropic, OpenAI, remote OpenAI-compatible that haven’t enabled it), the setting is a silent no-op and completion runs target-only — no breakage, no warnings. Model pairing: a curatedDRAFT_MODEL_MAPships with sensible defaults (qwen3-coder:30b→qwen2.5-coder:0.5b,deepseek-coder:33b→deepseek-coder:1.3b-base,codellama:34b→codellama:7b-code) so users who just select a big model from the picker get the speedup automatically if the draft is installed, with a one-click “install recommended draft” affordance if not. Tokenizer compatibility is a hard requirement (same family, same vocab) — the map only pairs models known to share tokenizers, and manual overrides that violate this are rejected with a specific error rather than producing garbled output. VRAM guardrails — running two models costs memory; integrates with the GPU-Aware Load Balancing roadmap entry so if VRAM headroom drops below the threshold while a big training job is going, speculative mode auto-disables and falls back to target-only rather than crashing. FIM prompt format carries through unchanged — the existing<|fim_prefix|>/<|fim_suffix|>/<|fim_middle|>delimiters are respected by both models in a matched pair. Configured viasidecar.speculativeDecoding.enabled(defaulttruewhen a draft mapping exists for the active model,falseotherwise — zero-config for the common case),sidecar.completionDraftModel(explicit override, falls back to the curated map),sidecar.speculativeDecoding.lookahead(default5— number of draft tokens per verification step; higher = more speedup when draft is accurate, lower = less wasted compute when draft is wrong),sidecar.speculativeDecoding.temperature(default0— greedy; raise for rejection-sampled generation if autocomplete gains feel stale), andsidecar.speculativeDecoding.minAcceptRateToKeepEnabled(default0.4— if observed accept rate drops below this after a warmup window, disable speculation automatically because the draft isn’t earning its keep and is just burning compute).sequenceDiagram participant E as Editor (ghost text) participant P as InlineCompletionProvider participant D as Draft model (0.5B) participant T as Target model (30B) E->>P: completion trigger (debounced) P->>P: build FIM prompt (prefix + suffix) loop Speculative step P->>D: generate k tokens (fast, serial) D-->>P: [t1, t2, ..., tk] P->>T: verify [t1..tk] in one parallel forward pass T-->>P: logits for each position P->>P: accept longest matching prefix, replace first mismatch end P-->>E: stream accepted tokens as ghost text Note over P: Typical accept rate 60-80%<br/>→ 2-4× throughput vs target alone
Agent Capabilities
- Chat threads and branching — parallel branches, named threads, thread picker, per-thread persistence
-
Persistent executive function — multi-day task state in
.sidecar/plans/tracking progress, decisions, and blockers across sessions - First-Class Skills 2.0 — Typed Personas with Tool Allowlists, Preferred Models, and Composition — upgrades the shipped SkillLoader from “inject markdown into the prompt” into a full persona system where each
.agent.md(or existing.md) skill is a declarative contract the runtime actually enforces. The parser at skillLoader.ts:54 already reads — but silently ignores — Claude-Code-compatible frontmatter fields (allowed-tools,disable-model-invocation); this entry makes every one of those fields load-bearing and adds several more. Enforced frontmatter schema: ```yaml — name: Git Expert description: Focused git workflow assistance scope: session # turn | task | session — how long the skill stays active allowed-tools: [git_status, git_diff, git_log, git_commit, git_branch, git_push, read_file] preferred-model: claude-sonnet-4-6 # switch to this model while active; restore on exit system-prompt-override: false # false = append to base prompt, true = replace it entirely disable-model-invocation: false # when true, only the user can invoke — model can’t auto-select extends: base-coder # inherit frontmatter + prompt from another skill variables: # user-supplied args at invocation branch: { description: Target branch, required: false } message: { description: Commit message, required: false } auto-context: # auto-inject these tool calls’ output as starting context- git_status
- git log -n 10
guards: [branch-protection] # Regression Guards that activate with this skill
tool-budget: # per-tool call caps while this skill is active
git_commit: 3
—
```
Each field maps to a concrete runtime behavior:
allowed-toolsintersects with the currenttoolPermissionsmap (most restrictive wins) so/git_expertliterally cannot callwrite_fileorrun_shell_commandregardless of the ambient mode — principle of least privilege per skill, turning adb-writerskill into a real capability boundary and not just an advisory one.preferred-modeltriggers a scopedupdateModel()swap for the skill’s duration; on exit the previous model restores (exceptions revert too, no sticky-state bugs).system-prompt-override: truefully replaces the base prompt with the skill’s content for the hardest personality lock — useful when you wantlatex_writerto be a LaTeX-only assistant with no inherited general-coder instincts; defaultfalsekeeps the existing append-as-context behavior for backward compatibility.disable-model-invocationprevents injection-style skill abuse where a hostile file could prompt the model into silently activating a privileged skill — the skill is user-invocation-only.extendsgives single-inheritance composition:frontend.agent.mdextendsbase-coderand inherits its tool allowlist + prompt preamble, overriding or extending per-field.variablesare resolved at invocation (/git_expert branch=feature/foo) and substituted into the prompt as${branch}— Claude Code’s$ARGUMENTSconvention is also accepted as an alias.auto-contextruns a fixed set of read-only tool calls before the skill’s first turn so the model sees pre-fetched state (thegit_expertskill always starts with currentgit status+ last 10 commits in its context, no wasteful first-turngit_statuscall).guardsregisters per-skill Regression Guards that activate only while the skill is in effect.tool-budgetcaps per-tool calls (prevents a runaway skill from callinggit_commit50 times). Skill stacking: users can invoke multiple skills simultaneously via/with git_expert /with technical-writer <task>or a persistent stack via the UI picker. Tool allowlists intersect (git_expert ∩ technical-writer= only tools both permit); preferred-model conflicts resolve by last-invoked-wins with a visible indicator; prompts concatenate in stack order with section headers so the model sees the layered persona clearly. Scopes:turnskills apply for exactly one user turn and revert;taskskills persist until the current task’s completion gate passes;sessionskills persist until explicitly ended with/unload <skill>or a new session starts — matches the mental model users already have from similar systems. Skills Picker UI: a new sidebar panel replaces “type the slash command and hope you remember the name” with a searchable grid of available skills — tagged by category (git / frontend / security / scientific / writing), preview of the persona’s opening instructions, the tool allowlist rendered as chips, and a Stack button to add without replacing. Telemetry (local-only, opt-in): per-skill usage count, average turns-to-completion, accept rate of the skill’s proposed changes — surfaced in the picker so users can see which skills are earning their keep and which are dead weight. Integration with every earlier entry: Facets consume skills via their existingskillBundlefield (a facet stacks its declared skills automatically on dispatch); Fork & Parallel Solve can wear different skills per fork (fork A withfourier_approach.agent.md, fork B withwavelet_approach.agent.md); Regression Guards declared in skill frontmatter fire only while the skill is active; Audit Mode can be required by a skill (require-audit: true) for write-heavy skills; Visual Verification criteria can be declared per-skill. Backward compatibility: every field is optional — the 8 shipped skills (break-this,create-skill,debug,explain-code,mcp-builder,refactor,review-code,write-tests) keep working unchanged since they declare none of the new fields; missing fields default to the current permissive behavior (full tool access, append-mode prompt, turn-scoped). Configured viasidecar.skills.directories(already exists — extends to accept both.mdand.agent.md),sidecar.skills.enforceAllowedTools(defaulttrue;falsefor legacy “advisory only” parsing),sidecar.skills.allowModelInvocation(defaulttrue; whenfalseonly user-initiated invocation is ever honored, even for skills that don’t declaredisable-model-invocation), andsidecar.skills.stackingMode(strict|union|last-wins, defaultstrict— strict intersects tool allowlists; union takes the superset; last-wins replaces prior skills entirely).
flowchart TD U[User invokes /git_expert] --> L[SkillLoader resolves +<br/>merges extended skills] L --> FM{Frontmatter fields} FM --> AT[allowed-tools →<br/>intersect with toolPermissions] FM --> PM[preferred-model →<br/>scoped updateModel] FM --> SP[system-prompt-override →<br/>replace or append] FM --> V[variables → substitute<br/>user args into prompt] FM --> AC[auto-context →<br/>pre-fetch read-only tool output] FM --> G[guards → register on<br/>HookBus for skill lifetime] FM --> TB[tool-budget →<br/>per-skill call caps] AT & PM & SP & V & AC & G & TB --> ACT[Skill active] ACT --> SCOPE{scope} SCOPE -->|turn| T1[Revert after 1 turn] SCOPE -->|task| T2[Revert when gate<br/>closes cleanly] SCOPE -->|session| T3[Revert on /unload<br/>or session end] T1 & T2 & T3 --> REV[Restore prior model,<br/>tool perms, hooks] -
Skill Sync & Registry — Git-Native Distribution Across Machines and Projects — extends Skills 2.0 from “manually drop .agent.mdfiles in each project’s.sidecar/skills/or~/.claude/commands/” to a proper three-tier distribution model matching Copilot Pro / Cursor’s global agent registry, but git-native and local-first so no SideCar-operated service stands between you and your skills. The three tiers, from smallest blast radius to largest, are already partially supported or genuinely new: (1) Project-level team sync is already solved — per the Multi-User Agent Shadows.gitignorecarve-out,.sidecar/skills/at the project root stays tracked in git; teams that commit skills there get cross-developer sync for free via the main repo’s history. No new feature needed at this tier, but this entry documents it as first-class. (2) User-level cross-machine sync is the real gap —~/.claude/commands/*.mdworks on one machine, but moving to a second laptop or a new dev container means copying files by hand. SideCar gainssidecar.skills.userRegistry, a git URL (or a local folder) the user owns: on activation, SideCar clones or pulls that repo into~/.sidecar/user-skills/, the SkillLoader picks up every.agent.mdinside as a user-scope skill, and the “Create Skill” flow offers a Publish to your registry checkbox that writes the new skill into the clone + commits + pushes. Standard git auth (SSH keys, GitHub tokens) handles credentials — no custom auth plumbing. Asidecar.skills.autoPullschedule (on-starthourlydailymanual, defaulton-start) keeps the clone fresh; conflicts surface as notifications pointing to the managed directory for manual merge rather than being silently swallowed. (3) Team-scoped additional registries layer on top —sidecar.skills.teamRegistriesaccepts an array of git URLs, each cloned into a separate subdirectory of~/.sidecar/team-skills/<registry-slug>/, with the Skills Picker tagging hits by origin registry so a developer on three overlapping teams can see which registry each skill came from and resolve name collisions deterministically (explicit registry prefix:/team-a/db-expertvs/team-b/db-expert). (4) Public marketplace is an optional fourth tier — a lightweight hosted index atregistry.sidecar.ai(or any compatible endpoint viasidecar.skills.marketplace) that crawls opted-in public git repos, exposes search / tags / author / install-count metadata, and the Skills Picker’s Browse tab queries it at the user’s request. Installing from the marketplace still does a standard git clone into a managed location — the registry is just an index, not a runtime dependency, so if it goes down your installed skills keep working and future installs fall back to direct git URLs. Skill metadata for distribution extends the Skills 2.0 frontmatter with:version: 1.2.0(semver, for pinning and update notifications);author: @user(renders in the picker, links to their registry);repository: https://github.com/user/skill-repo(source-of-truth URL for updates);license: MIT(surfaced in the picker so users see the legal posture before invoking);tags: [git, automation](for marketplace filtering);requires: [@core/base-coder@^1.0](inter-skill deps resolved transitively at install time). Versioning and pinning:sidecar.skills.versionsaccepts a map of{ "@user/skill-name": "1.2.0" }pins; the Skills Picker shows an Update available badge when a newer version exists upstream but never auto-updates a pinned skill without the user’s explicit OK. Trust model is explicit:sidecar.skills.trustedRegistrieslists registries that install without prompting; any other registry (including first-use of the public marketplace) prompts with “this skill will be allowed to suggest tool calls and prompt injections to your agent — review the source at?" on first install, with the skill's full frontmatter + body shown inline. Skills still respect the `allowed-tools` and `disable-model-invocation` guardrails from Skills 2.0, so even an untrusted skill can't silently escalate beyond its declared tool surface — the trust prompt is about the *intent* of the skill's prose, not about bypassing runtime enforcement. **Offline is a first-class mode**: once a skill is cloned, it works without network, the registry API is optional at runtime, and `sidecar.skills.offline` (default `false`) hard-disables every network operation — the extension becomes a pure local-cache reader, useful in air-gapped environments or in restrictive CI. **Integrates with every earlier feature**: Facets can reference skills via the same `@user/skill-name` identifier their `skillBundle` already uses, and the resolver fetches missing skills on first facet dispatch; Fork & Parallel Solve can pull different skill versions per fork (`fork A uses @core/refactor@1.0`, `fork B uses @core/refactor@2.0` — direct A/B test of a skill upgrade against real code); Project Knowledge Index can embed installed skills into the vector DB so `project_knowledge_search "git workflow"` finds a relevant skill as a retrieval hit; the Typed Sub-Agent Facets entry's `skillBundle` field resolves through this system so a facet's skill dependencies are fetched deterministically on install. Configured via `sidecar.skills.userRegistry` (git URL or local folder, default empty — opt-in), `sidecar.skills.teamRegistries` (array of git URLs, default empty), `sidecar.skills.marketplace` (URL, default `https://registry.sidecar.ai` but every install still passes through a trust prompt), `sidecar.skills.autoPull` (default `on-start`), `sidecar.skills.autoUpdate` (`manual` weeklydaily, defaultweekly— respects pins),sidecar.skills.trustedRegistries(array of registry URLs that skip the first-install trust prompt; empty by default),sidecar.skills.versions(pin map), andsidecar.skills.offline(defaultfalse; whentrue, no network calls at all).flowchart TD subgraph Tiers ["Distribution tiers"] T1[Project-level<br/>.sidecar/skills/<br/>tracked in repo<br/>ALREADY WORKS] T2[User-level<br/>userRegistry<br/>git clone to<br/>~/.sidecar/user-skills/] T3[Team-level<br/>teamRegistries[]<br/>per-registry subdirs] T4[Public marketplace<br/>optional index<br/>still git under the hood] end A[SideCar activation] --> PULL{autoPull schedule} PULL --> T2 PULL --> T3 UI[Skills Picker<br/>Browse tab] --> MP[marketplace API] MP --> T4 T1 & T2 & T3 & T4 --> SL[SkillLoader<br/>merges with conflict<br/>resolution by prefix] SL --> PICK[Unified picker<br/>tagged by origin] PICK --> INV[Skill invoked<br/>respects allowed-tools<br/>from Skills 2.0] subgraph Trust ["Trust on install"] INST[First install<br/>from new registry] --> PROMPT{trustedRegistries<br/>contains it?} PROMPT -->|yes| AUTO[Auto-install] PROMPT -->|no| MODAL[Show frontmatter +<br/>source link + Install button] end -
LaTeX agentic debugging — intercepts compiler output (pdflatex / xelatex / lualatex / bibtex / biber) and closes the loop between the raw log and the source tree without the user ever reading a
.logfile. When a build fails, a dedicated log-parsing agent classifies each error by type (missing brace, undefined reference, BibTeX key mismatch, undefined control sequence, overfull hbox, missing\end, etc.), maps the reported line number back to the actual offending location accounting for\input/\includetransclusion, and stages a targeted fix directly in the Pending Changes diff view — ready to accept with one click. Multi-error runs are handled in a single pass: the agent resolves errors in dependency order (e.g. fix the missing}before re-evaluating the downstream undefined-reference cascade) so the build converges in as few iterations as possible. BibTeX / Biber mismatches get special treatment: the agent cross-references the.bibfile, the.auxcitations, and the bibliography style to distinguish a missing entry from a key typo from a field-format violation, and proposes the minimal.bibedit. Configured viasidecar.latex.enabled(defaulttruewhen a.texfile is open) andsidecar.latex.buildCommand(defaults to auto-detectedlatexmkinvocation). Surfaces in the chat UI as a LaTeX Build status-bar item that turns red on failure and opens the agent panel on click. - Research Assistant — Structured Lab Notebook, Experiment Manifests, and Hypothesis Graph — ties the scattered research-adjacent primitives already across this ROADMAP (Literature synthesis, Doc-to-Test Loop, Integrated LaTeX Preview, LaTeX agentic debugging, Visualization Dashboards, Browser-Agent visual verification) and the shipped domain skills (
technical-paper,mathematical-proofs,signal-processing,statistics,radar-fundamentals,electromagnetics) into a cohesive lab-notebook workflow so SideCar stops being “a code assistant that happens to know LaTeX” and becomes “an end-to-end research collaborator that happens to also write code.” The gap today: a user running a simulation, collecting results, iterating on an algorithm, and drafting a paper has to hold all the connective tissue in their head — which experiment tested which hypothesis, which figure came from which data run, which citation supports which claim, which parameter sweep produced which plot. SideCar can help with any individual step but has no persistent model of the project as a research artifact. This entry introduces that model. Research Projects as first-class entities live under.sidecar/research/<project-slug>/(tracked in git — this is curated state, not ephemeral cache, so it stays out of the gitignored subdirs list) with a clean directory structure:project.yaml(top-level metadata: title, question, hypotheses list, status),experiments/<exp-id>/manifest.yaml(one per experiment with reproducibility fields — see below),literature/(symlinks or copies into the Literature synthesis index with project-specific notes overlaid),figures/<fig-id>/(source data + generation script + rendered outputs + captured seed),drafts/(paper sections, poster, slide decks), andobservations/<timestamp>.md(timestamped free-form notes the agent and user both contribute to). Experiment Manifest schema — every experiment is a reproducible, content-addressed unit: ```yaml id: exp-2026-04-16-fir-comparison hypothesis: “A wavelet-based decomposition outperforms FFT for detecting sub-cycle transients below -40 dB” parameters: sample_rate_hz: 48000 snr_db: [-40, -35, -30, -25, -20] # sweep filter_order: 256 seed: 42 environment: python: “3.11.7” requirements_hash: blake3:abc123… git_sha: def456… hardware: “M3 Max, 64GB unified” command: “python experiments/fir_vs_wavelet.py –config exp-config.yaml” artifacts:- results.parquet
- figures/snr_vs_detection.png
- logs/run.txt
interpretation: “
" supports: [hypothesis-id] # hypothesis this experiment supports or refutes refutes: [] related_work: [@smith2024, @jones2023] status: complete # planning | running | complete | abandoned ``` Running `/experiment run ` dispatches the command inside a Shadow Workspace (so the main tree stays pristine), captures every artifact into `experiments/ /`, and automatically populates `environment` from git state + `pip freeze` / `npm ls` / `cargo tree` + the current hardware probe (reuses the `system_monitor` tool from v0.57+). **Reproducibility is enforced, not advisory** — re-running a stored manifest fails loudly if the git SHA has drifted or the requirements hash doesn't match, with a "reproduce exactly" path that checks out the recorded SHA into a shadow and re-runs against pinned dependencies. Catches the researcher's-nightmare scenario of "I can't reproduce my own result from three weeks ago because `numpy` silently upgraded." **Hypothesis Graph** lives alongside the experiment store: nodes are hypotheses (with their status — `open` / `supported` / `refuted` / `needs-more-evidence` / `abandoned`), edges are `supports` / `refutes` / `depends-on` / `generalizes` derived from the experiments' `supports` and `refutes` fields. Rendered in a sidebar *Research Board* as a force-directed graph (via the Visualization Dashboards MCP layer once that ships, with a Mermaid fallback in the interim), showing which hypotheses have evidence piling up, which are contested (experiments both support *and* refute), and which are dangling (stated but never tested). The agent treats this graph as first-class context — "we have three experiments supporting H1 but H2 is untested and contradicts H1 — should we run an experiment isolating them?" becomes a suggestion the agent can make, backed by the actual state of your research. **New agent tools** layered onto the existing 23+ tool catalog: `run_experiment(manifest)` dispatches a recorded manifest and captures its artifacts; `log_observation(text, relatedTo: {experiment? | hypothesis? | figure?})` appends a timestamped observation to `observations/` with structured cross-references; `test_hypothesis(id)` aggregates evidence across linked experiments and returns a verdict with confidence (Bayesian posterior if priors are declared, otherwise a simple experiment-count ratio); `find_related_work(topic, depth)` walks the Literature graph (via the Literature synthesis index) up to N hops, surfacing papers the project doesn't yet cite but probably should; `suggest_next_experiment(hypothesis)` reasons over what would most reduce uncertainty given existing evidence (uses the Thinking Visualization `self-debate` mode so the user can see the reasoning); `validate_statistics(data, test, alpha)` runs sample-size / statistical power / effect-size / multiple-comparison checks via a bundled `statistics` skill-facet and blocks claiming a finding as "supported" until the checks pass; `generate_figure(data, spec, caption)` produces matplotlib / plotly / tikz output with captured seed + code + parameters, stored as a reproducible figure bundle; `draft_section(kind: 'abstract'|'intro'|'methods'|'results'|'discussion'|'related-work', sources)` produces a paper section grounded in the actual experiment manifests + literature graph, with every claim traced back to an experiment ID or citation (no unsupported claims survive the generation — composes with the RAG-Native Eval Metrics entry's faithfulness scorer). **Reviewer simulation** — before the user shares a paper draft, `/review-as ` spawns a critic agent wearing a reviewer persona (`skeptical-reviewer`, `domain-expert-reviewer`, `methods-critic-reviewer` all shipped as built-in skills) that reads the draft + underlying experiment manifests and returns structured objections: statistical concerns, missing controls, unsupported claims, related-work gaps, reproducibility red flags. Reuses the existing War Room infrastructure but with research-specific rubrics baked into the critic personas. **Statistical validity as a Regression Guard** — the `validate_statistics` check can be registered as a `pre-completion` guard on the `draft_section` tool so a paper draft literally cannot be marked done if the underlying experiments don't clear statistical validity (under-powered n, p-hacking patterns in the parameter sweep, undisclosed multiple comparisons) — composes directly with the Regression Guard Hooks entry in Agent Capabilities. **Notebook integration**: `.ipynb` files are first-class experiment artifacts. The agent can execute cells via a Jupyter kernel wrapper tool, capture outputs + figures as proper manifest artifacts, and keep the notebook and any refactored `.py` module in sync (the *Background doc sync* entry generalized to code↔notebook). **Composition with every earlier entry**: Literature synthesis feeds the literature graph and `find_related_work`; Doc-to-Test Loop verifies the *published paper's* claims against the implementation (catches the "what we wrote the paper said vs what the code actually does" drift, which is a common research-integrity hazard); Integrated LaTeX Preview renders the draft with live figures pulled from `figures/ /`; Visualization Dashboards renders the hypothesis graph, experiment timeline, and figure gallery inline; Browser-Agent Visual Verification sanity-checks each generated figure before it's committed to a draft; Fork & Parallel Solve lets the researcher explore two methodologies in parallel with side-by-side result comparison (the FFT vs wavelet scenario is literally an experiment-fork); Facets give per-domain personas (`statistician` for `validate_statistics`, `peer_reviewer` for `review-as`, `technical_writer` for `draft_section`); Project Knowledge Index indexes the research project so the agent retrieves across *past experiments* when suggesting new ones; Semantic Time Travel answers "three months ago we thought X about this hypothesis — what experiments changed our mind?"; Regression Guards enforce statistical validity; Shadow Workspaces host experiment runs so the main tree never ships with intermediate scratch files; Audit Mode is appropriate for write-heavy drafting sessions. **UI surfaces** a *Research* root in the SideCar sidebar with four sub-panels: *Projects* (list + active project selector), *Experiments* (timeline view, status badges, quick-reproduce button), *Hypothesis Graph* (interactive force-directed view), and *Drafts* (section-per-tab editor with citation previews on hover). A persistent status-bar item shows `Research: · 3 exp running · H2 needs evidence` so the user sees project state at a glance. Configured via `sidecar.research.enabled` (default `false` — opt-in), `sidecar.research.projectsPath` (default `.sidecar/research/`), `sidecar.research.activeProject` (default auto-detects from CWD or most-recently-touched), `sidecar.research.reproduceStrictMode` (default `true` — fail on git-SHA / requirements-hash drift during `/experiment reproduce`; set `false` for "best-effort reproduce" in exploratory work), `sidecar.research.statisticsGuardEnabled` (default `true` — block `draft_section` on statistical-validity failures), and `sidecar.research.reviewerPersonas` (default `['skeptical-reviewer', 'domain-expert-reviewer', 'methods-critic-reviewer']` — extendable with custom persona skill IDs).
flowchart TD subgraph Project [".sidecar/research/<slug>/ (tracked in git)"] M[project.yaml<br/>title, question, hypotheses] E[experiments/<id>/manifest.yaml<br/>+ artifacts + env + seed] L[literature/<br/>Zotero overlays + notes] F[figures/<id>/<br/>data + script + rendered] D[drafts/<br/>paper, poster, slides] O[observations/<ts>.md<br/>timestamped notes] end H[Hypothesis Graph] --> E H --> D E --> F E --> D AG[Agent research tools] --> RUN[run_experiment] AG --> LO[log_observation] AG --> TH[test_hypothesis] AG --> FR[find_related_work] AG --> SU[suggest_next_experiment] AG --> VS[validate_statistics] AG --> GF[generate_figure] AG --> DS[draft_section] AG --> RV[review-as persona] RUN --> E VS -.Regression Guard.-> DS DS --> D GF --> F FR --> L RV --> D U[User] --> UI[Research sidebar:<br/>Projects · Experiments ·<br/>Hypothesis Graph · Drafts] UI --> AG -
First-Class Jupyter Notebook Support — closes a gap that’s currently zero: SideCar has no notebook awareness at all.
read_fileon an.ipynbreturns raw JSON (unreadable to the model, useless for reasoning);edit_filerisks corrupting the JSON schema because the agent can’t see cell boundaries; VS Code’s nativevscode.NotebookEdit/NotebookData/NotebookControllerAPIs are unused; there’s no way to run a cell and read its output — which is the whole point of notebooks for the scientific, data, and research workflows the Research Assistant entry above depends on. This entry adds a complete, cell-aware notebook surface built on the native VS Code APIs. Eight new agent tools replace naive text handling of.ipynbfiles, each dispatching through the native notebook APIs so the underlying JSON schema stays intact and the user’s notebook editor reflects agent edits in real time just like human edits do: (1)read_notebook(path, { includeOutputs?, maxOutputChars? })returns structured{ cells: [{ index, kind: 'code' | 'markdown' | 'raw', language, source, outputs?: NotebookOutput[], metadata }] }— outputs are optional because they balloon context (a single matplotlib plot is ~50k base64 chars), and when included they’re truncated tomaxOutputCharsper cell with atruncated: trueflag; (2)edit_notebook_cell(path, cellIndex, newSource)surgically replaces one cell’s source without touching surrounding cells, outputs, or metadata — routed throughvscode.NotebookEdit.updateCellText; (3)insert_notebook_cell(path, atIndex, source, kind, language?)creates a new cell at a specific position viaNotebookEdit.insertCells; (4)delete_notebook_cell(path, cellIndex)removes a cell cleanly viaNotebookEdit.deleteCells; (5)reorder_notebook_cells(path, [newOrder])shuffles cells (useful when refactoring exploration notebooks into linear presentation order); (6)run_notebook_cell(path, cellIndex, { timeoutMs? })executes a cell via the notebook’s attachedNotebookControllerand returns structured outputs — text, tables, base64 images (auto-piped to Visual Verification when that feature is enabled and the cell produces a plot), stderr, execution count, elapsed time, and akernelError?field with stack trace when execution fails; (7)run_notebook_all(path, { stopOnError?, maxCellMs? })executes every code cell in order, streaming progress back to the agent as each completes so long-running notebooks don’t block on a single response; (8)generate_notebook(path, { outline, template?, kernel? })creates a new.ipynbfrom scratch with scaffolded cells — built-in templates ship for common shapes (data-exploration,signal-processing-analysis,paper-figure-reproduction,experiment-sweep,tutorial-walkthrough), and the outline can be a free-form list of cell descriptions the model fills in. Roundtrip fidelity is a hard invariant: reading a notebook → making an edit → writing it back preserves cell IDs, execution counts, cell metadata, kernel specs, language info, and (when the user didn’t ask for output changes) every existing output byte-for-byte. Enforced with a unit-level property test — a fuzzing harness that reads → no-op edits → writes 500 realistic notebooks and asserts byte equality. Catches the classic AI-assistant-corrupts-my-notebook failure mode before it ships. Cell-aware streaming diff previews extend the existingstreamingDiffPreviewFnso a multi-cell edit shows each cell’s diff in its own collapsible tile in the Pending Changes panel, not a single monolithic JSON-level diff (which is what the current raw-file path produces and which is useless for reviewing). Inserts / deletes / reorders get their own visual treatment so the user sees structural changes distinctly from content changes. Kernel handling: the agent respects the notebook’s attached kernel — if the user already selected “Python 3.11 (venv)”, agent tool calls execute there; no kernel attached triggers a one-time prompt via the existing approval system (“no kernel attached, select one or install the recommendedipykernelin.venv?”). Multi-language notebooks (Jupyter supports them) work — each cell’s declared language drives which kernel subprocess handles it. Execution outputs cap atsidecar.notebooks.maxOutputChars(default2000) per cell for the returned-to-agent view; the full output always persists in the notebook file regardless — truncation is for the agent’s working context, not for durable state. Output-to-Visual-Verification bridge: whenrun_notebook_cellproduces a base64 image output andsidecar.visualVerify.enabledis true, the image auto-flows into the Visual Verification pipeline (cheap checks for blank/clipped/axes-missing, optional VLM for criterion-matching) without the agent having to manually invokeanalyze_screenshot— so a matplotlib plot in a research notebook gets the same vision-guided correctness loop that the Browser-Agent entry describes for web preview. Merge-conflict handling:.ipynbmerges are notoriously bad in git because the JSON format serializes outputs, execution counts, and cell IDs into the diff. This entry doesn’t solve git-level merging (out of scope) but does make SideCar’s own conflict view cell-aware: when the Audit Mode treeview or Pending Changes panel detects a buffered notebook write colliding with an on-disk change, the three-way merge editor opens at the cell granularity rather than the JSON-line granularity. Integration with every earlier entry: Research Assistant treats.ipynbas a first-class experiment artifact —run_notebook_allon an experiment manifest’s notebook is the canonical reproduce path; Browser-Agent Visual Verification auto-hooks cell plot outputs; Regression Guards can registertrigger: post-writewithcommand: jupyter nbconvert --execute --to notebook --inplaceto enforce that every notebook edit keeps the notebook runnable; Doc-to-Test Loop can synthesize.ipynbtests from paper figures (generated cells that reproduce each figure get faithfulness-checked); Fork & Parallel Solve lets each fork contain its own notebook variant for side-by-side methodology comparison; Merkle Index chunks notebooks at the cell level (each cell is its own Merkle leaf, so a one-cell edit re-hashes one leaf not the whole notebook); Project Knowledge Index’s symbol extractor recognizes notebook cells as first-class chunks alongside TS/Python functions; Shadow Workspaces run notebooks in the shadow kernel so the main tree’s cached outputs aren’t perturbed during iteration; Audit Mode’s treeview shows per-cell diffs for buffered notebook writes. Built-in code↔notebook sync (the feature mentioned in Research Assistant): when a.pymodule and a sibling.ipynbboth declare a symbol (function, class), the agent keeps them in step — edits to the.pymodule prompt the agent to update the corresponding.ipynbcell and vice versa, with conflicts surfaced as a three-way merge. Configured viasidecar.codeNotebookSync.pairs(array of{ module, notebook }path pairs); absent = no-op. Configured viasidecar.notebooks.enabled(defaulttrueonce a notebook is opened or created in the workspace),sidecar.notebooks.includeOutputsInRead(defaultfalse— outputs bloat context; agent asks explicitly when needed),sidecar.notebooks.maxOutputChars(default2000),sidecar.notebooks.autoExecuteOnEdit(defaultfalse— agent edits don’t auto-run cells; explicit/runorrun_notebook_cellis required),sidecar.notebooks.visualizeOutputsInVLM(defaulttruewhen Visual Verification is enabled),sidecar.notebooks.cellGranularDiff(defaulttrue— cell-tile view;falsefalls back to raw JSON diff for debugging), andsidecar.notebooks.templates(array of template paths forgenerate_notebookbeyond the built-ins).flowchart TD A[Agent] --> T{Notebook tool} T --> RN[read_notebook<br/>structured cells +<br/>optional outputs] T --> EN[edit_notebook_cell<br/>via NotebookEdit.updateCellText] T --> IN[insert_notebook_cell<br/>via NotebookEdit.insertCells] T --> DN[delete_notebook_cell<br/>via NotebookEdit.deleteCells] T --> RC[run_notebook_cell<br/>via NotebookController.executeHandler] T --> RA[run_notebook_all<br/>streaming per-cell progress] T --> GN[generate_notebook<br/>templates + outline] EN & IN & DN --> WE[workspace.applyEdit<br/>WorkspaceEdit with<br/>NotebookEdit entries] WE --> IPY[.ipynb on disk] WE --> CELL_DIFF[Cell-granular diff<br/>in Pending Changes] RC --> OUT{Output kind} OUT -->|text / table| TXT[Back to agent,<br/>truncated to maxOutputChars] OUT -->|image base64| VV{visualVerify<br/>enabled?} VV -->|yes| VVP[auto-flow into<br/>Visual Verification pipeline] VV -->|no| TXT OUT -->|kernelError| ERR[Structured error +<br/>stack trace to agent] GN --> TPL[Built-in templates:<br/>data-exploration /<br/>signal-processing /<br/>paper-figure-repro /<br/>experiment-sweep] subgraph Invariants FID[Roundtrip fidelity:<br/>read → no-op edit → write<br/>= byte-equal<br/>property-tested] end
Multi-Agent
- Worktree-isolated agents — each agent in its own git worktree
- Agent dashboard — visual panel for running/completed agents
- Multi-agent task coordination — parallel agents with dependency layer
- Remote headless hand-off — detach tasks to run on a remote server via
@sidecar/headlessCLI - Multi-agent War Room — a red-team review layer that runs before output ever reaches the user. A lead Critic Agent adversarially challenges the coding agent’s solution (logic, security, edge cases, architecture), the coding agent rebuts and revises, and the exchange continues for a configurable number of rounds until the critic is satisfied or escalates to the user. The full debate is streamed live in a dedicated War Room sidebar panel so you can watch the agents argue in real time. Builds on the existing
runCriticChecks/HookBusinfrastructure — the critic becomes a first-class peer agent rather than a post-turn annotation pass. Configurable viasidecar.warRoom.enabled,sidecar.warRoom.rounds(default: 2), andsidecar.warRoom.model(can point to a different, cheaper model for the critic role).
User Experience
-
Integrated LaTeX Preview & Compilation — a first-class technical writing workflow built on top of the agent tool system. The agent gains a write_latextool that creates and edits.texfiles with full awareness of document structure (preamble, environments, bibliography). A background compilation watcher runslatexmk(ortectonicas a zero-config fallback) on every save, parses the log for errors and undefined citations, and surfaces them as inline diagnostics in the editor. A Ghost Preview panel opens beside the source and renders the compiled PDF (or a KaTeX/MathJax live render of the current math block when a full compile is pending), giving a true side-by-side experience without leaving VS Code. Bibliography integrity is checked separately — missing\cite{}keys and malformed.bibentries are flagged before the compile even runs. Configurable viasidecar.latex.compiler(latexmktectonic),sidecar.latex.ghostPreview.enabled, andsidecar.latex.bibCheck.enabled. - Background doc sync — silently update README/JSDoc/Swagger when function signatures change (2/3 shipped: JSDoc staleness diagnostics flag orphan/missing
@paramtags with quick fixes; README sync flags stale call arity in fenced code blocks with rewrite quick fixes. Swagger deferred — framework-specific, no in-repo OpenAPI spec to dogfood against; will revisit when a real use case lands.) -
Zen mode context filtering —
/focus <module>to restrict context to one directory -
Suggestion Mode — inverted-default approvals (flow-preserving UX) — a fundamental reframing of the tool-dispatch UX from “we’ll run it unless you stop us” to “here’s what I’d do, click to apply.” Today approvals in
cautiousmode (default) interrupt the developer’s flow: destructive tools pop a native modal (chatState.ts:242-250) and non-destructive ones render an inline confirm card (chatState.ts:255-259) the user must dismiss before the agent proceeds. Even inline cards are blocking from the agent’s POV —confirmFnawaits the promise beforeexecuteToolreturns. Both surfaces assume a binary accept/reject and force a context switch from writing-code-alongside-the-agent to reviewing-an-interrupt. The entiretoolPermissions: 'allow' | 'deny' | 'ask'axis (executor.ts:252-256) is static — there’s no “remember my choice for this session” affordance and no way to convert the interrupt into a non-blocking preview.The flip: a new approval style
sidecar.approvals.style: 'modal' | 'inline' | 'suggestion'(default staysinlineto preserve existing behavior; users opt intosuggestionwhen ready). Insuggestionmode, a would-be tool call doesn’t pause the agent — it materializes as a preview card in the chat transcript with the full payload visible (diff forwrite_file/edit_file, command text forrun_command, search query forgrep, etc.) and a one-click Apply / Skip / Edit & apply affordance. The agent’s call returns synthetically assuggestedrather thanexecuted, so the loop keeps moving: the next iteration sees a tool result like"Suggested write_file:src/auth.ts — user has not applied yet"and reasons accordingly (it might ask the user in text, move on to independent work, or queue a dependent call that flips topending-applyuntil the user acts). Nothing blocks; the developer scrolls through suggestions at their own pace, applying in order or out of order. This inverts the trust model: instead of the user being the brake on an agent sprinting forward, the user is the throttle gating each action in — closer to how Copilot Edits, Cursor’s Agent mode, and Continue.dev’s accept-per-hunk flow treat high-autonomy edits.Why this solves the specific pain — the current UX problem isn’t the existence of approvals (security and trust depend on them) but the shape of the interrupt. A 20-file refactor currently fires 20 inline cards, each blocking until dismissed; the developer can’t keep writing code in another file while waiting because the agent is paused too. In
suggestionmode, all 20 fire as non-blocking cards, the agent continues reasoning (producing downstream suggestions that depend on earlier ones aspending-apply), and the developer drains the queue at their own cadence — or applies all at once from a panel summary. Multi-File Edit Streams (v0.65) already plans edits as a DAG;suggestionmode naturally pairs with that, showing the Planned Edits card with per-edit Apply buttons instead of running writes behind the user’s back.Mechanism and infrastructure changes required:
- New
SuggestionStore— process-wide singleton holdingSuggestedAction { id, tool, input, rationale, createdAt, status: 'pending' | 'applied' | 'skipped' | 'edited', dependsOnIds: string[] }. The executor’s approval gate (executor.ts:303-401) branches onconfig.approvals.style === 'suggestion': instead of callingconfirmFn, it pushes aSuggestedActioninto the store and returns a syntheticToolResultContentBlockwithis_error: falseand a structured payload the agent can reason over ({ status: 'suggested', suggestionId, summary }). - Webview protocol extension — new outgoing commands
suggestionCreated,suggestionApplied,suggestionSkipped,suggestionEdited; new incoming commandsapplySuggestion,skipSuggestion,editSuggestion. Carries the full tool input so the preview can render syntax-highlighted content, a unified diff (for file writes via the existingstreamingDiffPreviewrenderer), or a command transcript (forrun_command). - Chat UI tile per suggestion — styled like the Planned Edits card (v0.65 chunk 4.4a) with theme-token badges per tool type, a path / command summary line, expandable full-payload details, and three buttons: Apply (executes via
executeOneToolUsewith the original context), Skip (marksstatus: 'skipped', surfaces as a “not applied” tool_result on the next turn so the agent knows), Edit & apply (opens the tool input in a modal editor — tweak the shell command, adjust file content, rewrite the grep pattern — then apply the modified version; applied suggestions carry anedited: trueflag the agent sees). Inline keyboard shortcuts:⌘⏎applies,Escapeskips,eedits. - Dependency tracking — when a suggestion’s
inputreferences a path another pending suggestion would create or modify, we markdependsOnIds. The UI badges dependent suggestions asawaiting-parentand greys the Apply button until prerequisites land, preventing the “apply a suggestion that edits a file that doesn’t exist yet” footgun. - Bulk actions on the summary panel — a persistent Pending Suggestions (N) strip above the chat input (reusing the steer-queue-strip layout from v0.65 chunk 3.3): Apply all (topologically), Skip all, Apply file-writes only (for when you trust edits but want to review shell commands individually). Each bulk action confirms once with a modal rather than firing N modals.
- Session-scoped “auto-apply” affordance — a checkbox on each suggestion: “Auto-apply future
write_fileonsrc/auth/**” converts that pattern into a session-scoped allowlist so repeated identical suggestions on the same surface auto-apply. Decays at session end (not a persistent setting — opposite failure mode from a global quiet-mode switch where users forget it’s on). Backed by a newSessionAllowlistinterface onChatStatethat the approval gate consults before creating a suggestion.
What stays blocking:
suggestionmode is opt-out-able per tool viasidecar.approvals.alwaysConfirm: string[](default['run_command', 'git_push', 'delete_file']). Truly destructive ops still fire the existing native-modal path because the cost of an “oops I clicked Apply by accident” onrm -rfis not recoverable. TheNATIVE_MODAL_APPROVAL_TOOLSlist (chatState.ts:242) becomes the default foralwaysConfirmand users can tighten or loosen it per taste. Suggestion mode is for the common case of file edits + reads + searches, which is where the flow-breaking accumulates; the truly destructive gate stays in place.Integration with every earlier entry: Multi-File Edit Streams (v0.65) renders its Planned Edits card’s per-file entries as suggestions natively — each DAG node becomes a
SuggestedActionand the existing dependency layering maps 1:1 to the suggestion store’sdependsOnIds. Steer Queue (v0.65) remains the mid-run course-correct channel — a steer queued while suggestions are pending can say “skip thesrc/legacy/**ones” and the summary strip honors that. Shadow Workspaces stay compatible — applying a suggestion in suggestion mode routes throughexecuteOneToolUsewhich honorscwdOverride, so approved suggestions land in the shadow tree exactly as today’s approved writes do. Audit Mode becomes redundant forwrite_filein suggestion mode (the SuggestionStore IS the buffer; the user reviews + applies directly) but stays relevant forrun_commandand other non-write tools. Regression Guards fire against the applied set, not the suggested set — if the user skips half, guards only see what landed. Fork & Parallel Solve shows each fork’s suggestions in its own column of the Fork Review panel.Phased rollout: phase 1 ships
style: 'suggestion'behind an opt-in flag with the SuggestionStore, webview tiles, and basic Apply/Skip — no editing, no dependency tracking, no bulk actions. Phase 2 adds Edit & apply,dependsOnIds, and bulk actions. Phase 3 adds session-scoped auto-apply patterns and per-toolalwaysConfirmtuning. Default remainsinlinethrough all three phases; user-opt-in only becomes the default after a release of telemetry-backed validation that Apply/Skip/Edit rates match the “non-blocking wins” hypothesis (users apply >80% of file-write suggestions with <5% rework).Configured via sidecar.approvals.style(modalinlinesuggestion, defaultinline),sidecar.approvals.alwaysConfirm(string[], default['run_command', 'git_push', 'delete_file']),sidecar.approvals.autoApplyPatterns(session-scoped — UI-driven, not persisted; shown here for discoverability),sidecar.approvals.showDependencyEdges(defaulttrue), andsidecar.approvals.bulkConfirmThreshold(default5— above this many suggestions, Apply all requires one confirm click rather than silently running). - New
- Dependency drift alerts — real-time feedback on bundle size, vulnerabilities, and duplicates when deps change
Observability
-
RAG-Native Eval Metrics (RAGAs) + Qualitative LLM-as-Judge (G-Eval) — reopens the LLM-as-judge scoring deferral from v0.50 (documented at ROADMAP.md under Eval harness gaps: “deterministic predicates give crisper regression signal than a second-model scoring hop, so this was intentionally skipped… reopen if we start shipping features where correctness is fuzzy rather than binary”). The deferral holds up for the features that existed at v0.50 — tool-trajectory assertions, file-state substring matches, mustContain/mustNotContain predicates on final output were the right call. But the features added since and pending across this ROADMAP (Project Knowledge Index with graph-fusion retrieval, Merkle-addressed fingerprints, Fork & Parallel Solve with its Judge mode, Doc-to-Test constraint extraction, Browser-Agent Visual Verification, Thinking Visualization modes) all have correctness surfaces that are fuzzy — retrieval quality, answer faithfulness, reasoning coherence, visual-check calibration — and trying to keep these honest with only deterministic predicates leaves a regression blind spot. This entry extends the existing tests/llm-eval/ harness with two complementary metric layers, kept additive: deterministic predicates still gate on
mustContainand tool trajectories (cheap, reliable, first line of defense); fuzzy metrics layer on top as optional per-case expectations the CI also gates on. Layer 1 — RAGAs metrics for retrieval-augmented features (Project Knowledge Index, monorepo cross-repo search, Literature synthesis, Memory Guardrails): four core scorers implemented as JS-native LLM-as-judge calls, not a Python subprocess dependency on the ragas package — the metrics are simple enough to reimplement cleanly (each is a prompt + a parser), and the VS Code extension shouldn’t drag Python into its deployment story. (1) Faithfulness — does the generated answer only claim things supported by retrieved context? Judge decomposes the answer into atomic claims, then for each claim asks “is this entailed by the retrieved context?”; score = entailed_claims / total_claims. Catches hallucination where the agent invents facts not in retrieved docs. (2) Answer Relevancy — does the answer actually address the user’s question? Judge generates N alternative questions the answer would have correctly responded to, compares their embedding to the original question’s, scores by mean cosine similarity. Catches off-topic drift. (3) Context Precision — did retrieval rank relevant chunks higher than irrelevant ones? Judge rates each returned chunk as relevant / irrelevant to the ground-truth answer, then computes mean reciprocal rank weighted by relevance. Catches “the right file was in position 8 but position 1 was a red herring” regressions that a flat “was the right file retrieved?” metric misses. (4) Context Recall — did retrieval find all the chunks needed for the ground-truth answer? Judge decomposes the ground truth into atomic claims, for each asks “is there a retrieved chunk that supports this?”; score = supported_gt_claims / total_gt_claims. Catches missing-needle failures that only Context Precision can’t detect. Cases declare these via a newragexpectations block:expect: { rag: { faithfulness: { min: 0.85 }, contextPrecision: { min: 0.7 }, contextRecall: { min: 0.8 } } }. Layer 2 — G-Eval qualitative scoring for fuzzy output aspects (coherence, correctness on ambiguous tasks, style, custom criteria) implemented as a generic LLM-as-judge scorer with a common chain-of-thought template inspired by DeepEval’s G-Eval — again re-implemented in TS rather than shelled out to the Python package. Each G-Eval scorer takes a name, a description of what’s being measured, and a 1-N rating scale; the judge generates a CoT reasoning trace, then emits a numeric score with justification. Built-in criteria ship pre-tuned: coherence (does the response follow a logical structure?), correctness (given the task description, is the output free of errors?), relevance (does it address what was asked?), fluency (well-formed prose), actionability (can the user act on the answer without clarification?); custom criteria are user-declarable viasidecar.eval.gEvalCriteriawith a name, description, and scale. Used by cases asexpect: { gEval: { coherence: { min: 7 }, correctness: { min: 8 } } }. Judge’s full reasoning is captured in the eval report so regressions come with why they’re regressions, not just “score dropped 0.4 → 0.3.” Shared LLM-as-judge primitive backs both layers at tests/llm-eval/scorers/llmJudge.ts — a single dispatch point that handles judge-model routing (via Model Routing rules’judgerole so cheap-judge vs gold-judge is configurable), caches results aggressively to.sidecar/cache/eval-judge/keyed by(judgeModel, promptHash, inputHash)so re-running the suite against unchanged inputs is free, and supports cheap-judge-first / gold-judge-on-borderline for cost control: run Haiku on every case, escalate to Sonnet only when Haiku’s score is near the pass threshold (within a configurable margin) so close calls get the better judge but clear passes/fails don’t burn the budget. Ground-truth curation workflow: RAGAs recall requires ground-truth answers, which the current harness doesn’t collect. A newtests/llm-eval/ground-truth/directory stores per-case ground truths as markdown + YAML frontmatter ({ answer: "...", supportingFacts: [...], requiredContext: [...] }); a/curate-ground-truthCLI walks uncurated cases, generates draft ground truths via the judge model, and surfaces them in a review UI where the human edits and commits. The workflow is explicit about provenance: ground truths carry acurator: human | model | model-reviewedtag in frontmatter so eval reports can flag metrics computed against unreviewed model-generated truths as tentative rather than authoritative. Regression tracking surface: eval report output extends the existing text summary with per-metric trend data (faithfulness: 0.87 (↓ 0.03 from prev)) and a CI-friendlytests/llm-eval/history.jsonlappend-only log of each run’s metrics keyed by git SHA, sonpm run eval:reportcan render a 30-day chart showing whether retrieval precision is drifting as the Merkle index changes, faithfulness is regressing as prompts evolve, or coherence is degrading on cheaper-model runs. Cost controls:sidecar.eval.judgeBudgetPerRun(default$1.00USD equivalent — a full RAG+G-Eval suite with Haiku-judge costs ~$0.10–0.30 typically, so this is conservative); exceeding the budget skips the remaining fuzzy scorers with a visible warning rather than billing-surprising the user. Deterministic scorers always run — they’re free. Composes with every earlier retrieval entry: Project Knowledge Index acceptance criteria become concrete RAGAs thresholds (context precision must not regress after symbol-chunking migration); Merkle fingerprint stability becomes a test (same root → identical retrieval output → identical RAG scores, which is a stronger regression signal than per-feature tests); Fork & Parallel Solve’s built-in Judge mode reuses the samellmJudgeprimitive so its in-runtime scoring is consistent with the offline eval scoring; Doc-to-Test Loop’s synthesized tests get faithfulness-checked against the source doc; Visual Verification’s VLM verdicts get a coherence check via G-Eval. Configured viasidecar.eval.ragMetrics(array of enabled RAGAs scorers, default['faithfulness', 'answerRelevancy', 'contextPrecision', 'contextRecall']),sidecar.eval.gEvalCriteria(record of name →{ description, scale: [1, N] }for custom criteria beyond the built-ins),sidecar.eval.judgeBudgetPerRun(default1.00),sidecar.eval.cheapJudgeModel(default inherits from Model Routingjudgerole),sidecar.eval.goldJudgeModel(default empty — disables gold escalation if unset),sidecar.eval.goldJudgeMargin(default0.1— escalate to gold when cheap-judge score is within this margin of the threshold), andsidecar.eval.cacheDir(default.sidecar/cache/eval-judge/, covered by the gitignored-subdirs carve-out).flowchart TD CASE[Eval case with<br/>expect: mustContain +<br/>rag + gEval blocks] --> RUN[Run SideCar<br/>agent on input] RUN --> OUT[Final output +<br/>retrieved context +<br/>tool trajectory] OUT --> DET[Deterministic scorers<br/>mustContain, trajectory,<br/>file-state] OUT --> RAG{RAGAs scorers} OUT --> GEV{G-Eval scorers} RAG --> FA[Faithfulness:<br/>atomic claims vs context] RAG --> AR[Answer relevancy:<br/>generated questions ≈ input] RAG --> CP[Context precision:<br/>weighted MRR] RAG --> CR[Context recall vs<br/>ground truth] GEV --> COH[Coherence 1-10] GEV --> COR[Correctness 1-10] GEV --> CUSTOM[User criteria] FA & AR & CP & CR & COH & COR & CUSTOM --> JUDGE[LLM-as-judge<br/>cheap first → gold on borderline] JUDGE --> CACHE[(.sidecar/cache/eval-judge/<br/>judgeModel + promptHash)] DET & JUDGE --> AGG[Aggregate result] AGG --> HIST[Append to<br/>history.jsonl by SHA] HIST --> REPORT[Trend report<br/>per-metric deltas +<br/>judge reasoning traces] - Model comparison / Arena mode — side-by-side prompt comparison with voting
- Role-Based Model Routing & Hot-Swap — replaces SideCar’s current scatter of per-role model settings (
sidecar.model,sidecar.completionModel,sidecar.critic.model,sidecar.delegateTask.workerModel,sidecar.fallbackModel, and theplannerModel/judgeModel/vlmknobs added in other roadmap entries) with a unified, declarative rule set that routes each dispatch to the right model for its actual job — so you can run Llama 3 for free local chat, promote to Claude Sonnet/Opus for the high-reasoning agent loop, and drop to Haiku for cheap summarization, all in one coherent config. The target experience: ultra-pro intelligence exactly where it earns its keep (the multi-turn agent loop, the War Room critic, the planner pass before a wide refactor) with the rest of the session staying free and local. Rule shape:"sidecar.modelRouting.rules": [ // First match wins — list most specific first. { "when": "agent-loop.complexity=high", "model": "claude-opus-4-6" }, { "when": "agent-loop", "model": "claude-sonnet-4-6" }, { "when": "chat", "model": "ollama/llama3:70b" }, { "when": "completion", "model": "ollama/qwen2.5-coder:7b" }, { "when": "summarize", "model": "claude-haiku-4-5" }, { "when": "critic", "model": "claude-haiku-4-5" }, { "when": "worker", "model": "ollama/qwen3-coder:30b" }, { "when": "planner", "model": "claude-haiku-4-5" }, { "when": "judge", "model": "ollama/qwen2.5-coder:7b" }, { "when": "visual", "model": "claude-sonnet-4-6" }, { "when": "embed", "model": "local/all-MiniLM-L6-v2" } ]Role taxonomy (every dispatch point in SideCar is tagged with one):
chat(one-off Q&A without tools),agent-loop(multi-turn tool-using work),completion(FIM autocomplete),summarize(ConversationSummarizer, prompt pruner, tool-result compressor),critic(War Room critic, completion-gate critic),worker(delegate_tasklocal research worker),planner(edit-plan pass, fork approach planner),judge(fork judge, constraint-approval scoring),visual(screenshot VLM for browser-agent verification),embed(Project Knowledge Index vectors — this one is provider-specific and rarely overridden, but exposed for completeness). Compound match expressions — rules can include signal filters after the role:agent-loop.complexity=high(turn count × tool fan-out × file span exceeds threshold),agent-loop.files~=src/physics/**(glob match on files the turn is touching),chat.prompt~=/pro\b|think hard/(explicit user cue in the prompt),agent-loop.retryCount>=3(escalate on recurring failure). Signals are computed cheaply before each dispatch and passed to the router along with the role. Hot-swap is literal: within a single conversation, the active model changes at role boundaries —SideCarClient.updateModel()already exists, so theModelRouterservice just calls it with the rule-resolved choice before each dispatch. Message history is preserved across swaps (all backends speak compatible message shapes for the roles we swap into); tool definitions are unchanged; Anthropic prompt-cache breakpoints survive within a same-model run so the 90% cached-read discount doesn’t get reset by a cross-role swap to a different provider. Cost visibility: a status-bar item shows the current active model with a tooltip breaking down this session’s spend by role (agent-loop: $0.42 (sonnet) · chat: $0.00 (local llama) · summarize: $0.03 (haiku)) so users see exactly where their money is going. Budget-aware downgrade: each rule can declare adailyBudget/sessionBudget/hourlyBudgetand an optionalfallbackModel; when the cap trips, the router silently downgrades (claude-opus-4-6→claude-sonnet-4-6→claude-haiku-4-5→ollama/qwen3-coder:30b) and surfaces a single non-blocking toast. One-off override via the/model <name>slash command for the rest of the session regardless of rules, plus@opus,@sonnet,@haiku,@localinline sentinels in the user message that bypass routing for just that turn. Migration from existing per-role settings is automatic: on first activation withmodelRouting.rulesset, SideCar translates any non-defaultsidecar.completionModel/sidecar.critic.model/ etc. into synthesized rules and writes them into the new config, keeping the old fields as no-ops for backward compat. Users withoutmodelRouting.ruleskeep the current per-field behavior — zero migration cost for the simple case. Composes with every earlier entry: Skills 2.0’spreferred-modelfrontmatter becomes a per-skill rule injected for the skill’s lifetime; Facets’preferredModelbecomes a per-facet rule; Fork & Parallel Solve can declare per-fork model rules (fourieron Sonnet,waveleton Haiku for cost comparison); the GPU-Aware Load Balancing feature’s auto-downgrade on VRAM pressure becomes one of the router’s triggers rather than a parallel code path; Audit Mode can require confirmation when the router would escalate to a paid model without user awareness. Ad-hoc complexity heuristic foragent-loop.complexity=high(tunable, good defaults): turn count >= 5 OR distinct-files-touched >= 3 OR consecutive-tool-use-blocks >= 8 OR user prompt contains explicit reasoning cues (prove,verify,reason through,think step by step). The heuristic is boring on purpose — anything smarter invites surprises about why a cheap session suddenly escalated. Configured viasidecar.modelRouting.enabled(defaultfalse— opt-in until users have calibrated rules),sidecar.modelRouting.rules(ordered rule list, first match wins),sidecar.modelRouting.defaultModel(fallback when no rule matches, defaults tosidecar.model),sidecar.modelRouting.visibleSwaps(defaulttrue— show a brief toast on model swap so the user knows what happened;falsefor silent operation once calibrated), andsidecar.modelRouting.dryRun(defaultfalse; whentrue, the router logs what it would have selected but sticks withsidecar.model, for safely calibrating rules before enabling them).flowchart TD D[Dispatch point] --> ROLE[Tag role:<br/>chat / agent-loop /<br/>completion / summarize / ...] ROLE --> SIG[Compute signals:<br/>complexity, files, retries,<br/>prompt cues] SIG --> RULES{Match rules<br/>top-down} RULES -->|first match| BUDG{Budget ok?} BUDG -->|yes| SWAP[updateModel to rule's choice] BUDG -->|exhausted| FALL[Fallback model<br/>or chain to next rule] FALL --> BUDG SWAP --> DISP[Dispatch to backend] DISP --> TRACK[Track spend<br/>per role] TRACK --> STATUS[Status bar:<br/>active model + tooltip<br/>spend breakdown] RULES -->|no match| DEF[defaultModel] DEF --> DISP - GPU-Aware Load Balancing — SideCar monitors VRAM pressure in real time (via
nvidia-smi,rocm-smi, or the Metal Performance HUD on Apple Silicon) and automatically backs off when a competing workload — such as a PyTorch/JAX training run — is detected consuming significant VRAM. Three escalating responses: (1) silent downgrade — swap to a smaller quantised variant of the current model (e.g.q8_0→q4_K_M) if one is available locally; (2) user prompt — if no smaller local model is available, surface a non-blocking toast offering to switch to a cloud provider (Anthropic / OpenAI) for the duration of the heavy workload; (3) pause & queue — if the user dismisses the toast, queue pending agent turns and retry once VRAM headroom recovers. Restores the original model automatically when pressure drops below the threshold. Configurable viasidecar.gpuLoadBalancing.enabled,sidecar.gpuLoadBalancing.vramThresholdPercent(default:80),sidecar.gpuLoadBalancing.fallbackModel, andsidecar.gpuLoadBalancing.cloudFallbackProvider. - Real-time code profiling — MCP server wrapping language profilers
Security & Permissions
- Granular permission controls — per-category tool permissions, upfront scope requests
- Enhanced sandboxing — constrained environments for dangerous tools
- Customizable code analysis rules —
sidecar.analysisRuleswith regex patterns and severityProviders & Integration
-
Remote PR Review Automation — Fetch, Analyze, Post Line-Anchored Comments — extends the shipped local
reviewCurrentChangesinto a proper remote PR review loop. Todaysidecar.reviewChangesruns on whatever’s in the local working tree; if the user wants to review someone else’s PR they have togit fetch && git checkoutmanually first. This entry addsSideCar: Review Pull Request <#>which takes a PR number (or owner/repo + number, or a full GitHub URL), fetches the PR’s unified diff via/repos/:owner/:repo/pulls/:number+/repos/:owner/:repo/pulls/:number/commits+/repos/:owner/:repo/pulls/:number/comments, runs the reviewer against the fetched diff plus the existing comment thread context (“the reviewer already flagged the auth regression in comment #47 — don’t re-flag it”), and posts line-anchored review comments back viaPOST /repos/:owner/:repo/pulls/:number/commentswith thepath+line+side+commit_idthe GitHub API requires. Structured reviewer output: the reviewer prompt is extended to emit JSON-tagged findings —{ path, line, side: 'RIGHT' | 'LEFT', severity: 'block' | 'suggest' | 'nit', message, suggestedChange? }— so the poster can routeblockfindings to a requested-changes review,suggestto regular comments, andnitto resolved discussions by default. Dry-run by default: first run produces a preview webview listing every proposed comment; the user picks which to post.sidecar.pr.review.autoPost: trueopts into posting directly (for CI bots / automation accounts). Composes with Skills: thereview-codeskill that already ships becomes the default prompt for remote PR review; project-local review skills in<workspace>/.sidecar/skills/override for domain-specific review rules (security-focused PRs, performance-sensitive modules). Composes with Facets: a batch of facets can each review the same PR —security-reviewerlooks for auth/injection issues,test-authorflags missing test coverage,general-codercatches logic bugs — and the aggregated-review UI from v0.66 merges their findings with per-facet tags so the user sees “security-reviewer flagged lines 42-48 for CSRF, test-author flagged lines 12-20 for missing test, general-coder had no issues.” Configured viasidecar.pr.review.defaultSkill(defaultreview-code),sidecar.pr.review.severityMapping(maps the three severity tiers to review event types — defaultblock → REQUEST_CHANGES,suggest → COMMENT,nit → COMMENT),sidecar.pr.review.autoPost(defaultfalse), andsidecar.pr.review.includeExistingComments(defaulttrue— setfalseto do a clean review that ignores prior reviewer signal). -
CI Failure Analysis & Fix — GitHub Actions Log Ingestion with Proposed Repair Commits — closes the gap between “CI failed on my PR” and “I know why and how to fix it.” Today SideCar’s Terminal Error Interception (shipped) catches failures in the integrated terminal; this entry extends the same flow to remote CI.
SideCar: Analyze Failed CI Runfetches the latest failed run for the current branch via/repos/:owner/:repo/actions/runs?branch=...&status=failure&per_page=1, downloads the failed job’s log via/repos/:owner/:repo/actions/jobs/:job_id/logs(with 4 MB cap; on overflow, usestailsemantics via a Range header), extracts the failing step’s log slice using the##[endgroup]/##[error]markers GitHub Actions emits, and feeds it through the same diagnose-in-chat synthesized-prompt path that terminal errors already use. PR-aware mode: when the current branch has an open PR, the flow auto-detects it and offers “Propose a fix commit” — the agent diagnoses the failure, opens a new<branch>-fix-cibranch in a Shadow Workspace, makes the fix, runs local tests, and opens a draft follow-up PR or pushes directly onto the original branch (gated by user approval). Log parsing: per-runner-type (Linux / macOS / Windows) regexes strip ANSI, collapse timestamp prefixes, detect test-runner output patterns (vitest/jest/pytest/go test/cargo test/rspec— use existing TestRunnerRegistry), and surface the test that failed + the assertion message rather than the raw 4 MB log. Composes with Actions filter: a newsidecar.ci.analysis.jobFilter(glob array against job name) lets users scope to the jobs that matter — if CI has alintjob and atestjob, analyzing thetestfailure first is usually right. Configured viasidecar.ci.analysis.enabled(defaulttrue),sidecar.ci.analysis.maxLogBytes(default4_000_000),sidecar.ci.analysis.jobFilter(default["*"]), andsidecar.ci.analysis.autoProposeFix(defaultfalse— requires user confirmation before opening a fix branch). -
Draft PR From Branch — One-Command Push + Generate + Open — a single command that replaces the three-step manual dance most users do today ( git push -u origin HEAD+ craft title/body +gh pr create).SideCar: Create Pull Requestrunsgit push -u origin HEAD(with a pre-flight branch-protection check — see below), invokes the existing localsummarizePRpath against the commit range since the base branch’s divergence point (git merge-base) to produce a title + body, opens a preview for the user to edit, then callsPOST /repos/:owner/:repo/pulls. Draft by default: PRs are opened as drafts (draft: true) so they don’t spam reviewer queues before the author’s had a last look; a one-click Ready for review follows the existing github tool pattern. Template awareness: when.github/pull_request_template.mdor.github/PULL_REQUEST_TEMPLATE.mdexists, it’s loaded and its sections are filled in section-by-section by the model (not overwritten wholesale — preserves H2 headings the template declares). Configured viasidecar.pr.create.draftByDefault(defaulttrue),sidecar.pr.create.baseBranch(default auto-detected fromHEAD’s upstream-tracking ororigin/HEAD), andsidecar.pr.create.template(autoignoreabsolute path, default auto). -
Branch Protection Awareness — Pre-Push Status-Check + Required-Reviewer Warnings — prevents the common “pushed straight to main, failed CI, got chased by the team” footgun. Before any
git push/git_pushtool call against a branch, SideCar queries/repos/:owner/:repo/branches/:branch/protection(authenticated) and/repos/:owner/:repo/commits/:sha/check-runsto find required status checks + required reviewer counts. If the branch is protected AND the push target doesn’t satisfy the required checks OR lacks the required approvals, a modal surfaces the gaps (“mainrequires status checksci/lintandci/test; onlyci/linthas passed on this commit. Required reviewer count is 2; you have 0 approving reviews.”) with Proceed / Cancel. The warning is skipped for unprotected branches and for the user’s own feature branches. Composes with Draft PR: the Create Pull Request flow runs this check against the base branch at submit time and warns that the PR can’t merge until checks/reviewers are satisfied — sets expectations before the author waits on CI. Configured viasidecar.pr.branchProtection.enabled(defaulttrue),sidecar.pr.branchProtection.warnEvenIfPassing(defaultfalse— turns on a soft reminder even when checks pass so the user sees what’s required). -
Process Lifecycle Hardening — ManagedChildProcess + Registry + Orphan Sweep — closes the real-world failure mode where VS Code window reload or abrupt IDE close strands child processes spawned by SideCar (MCP stdio servers, the
ShellSessionpersistent shell, custom-tool wrappers, future background workers). Current state is better than many extensions —MCPManager,ShellSession,EventHookManager,ToolRuntime, andSchedulerall implementdispose()and are pushed intocontext.subscriptionsso the VS Code lifecycle drives teardown — but three gaps bite under real conditions. (1)MCPManager.disconnect()awaits the SDK’sclient.close()with no timeout (mcpManager.ts:420-424); a stdio server whose stdin handler blocks meansclose()hangs forever, VS Code’s own deactivate timeout force-kills the extension host, and the child process gets reparented to init (Linux) or abandoned (macOS). (2) Activation assumes a clean slate — there is no detection of “I rebooted because VS Code crashed, there’s a stale mcp-server python process still bound to port 9000.” (3) HTTP/SSE MCPs that bind local ports leave the port inTIME_WAITor held by the orphan; new sessions fail to connect with a confusing error. This entry introduces a unified lifecycle primitive across every spawn site.ManagedChildProcesswrapper atsrc/agent/processLifecycle.tsstandardizes every spawn: enforcesdetached: falseso SIGTERM propagates on parent death, pipes stdio (never inherits) so descriptors close cleanly, registers PID + spawn signature into aProcessRegistryon start, emits typed lifecycle events (spawned/closed/killed/timeout) observable from tests, and provides one canonical close chain:graceful close (await provided cleanup fn) → 2s timeout → SIGTERM → 1s → SIGKILL. The chain is deterministic — worst-case 3s per child, parallelizable across N children, sodispose()on the extension has a bounded cost VS Code can honor.ProcessRegistrysingleton pushed intocontext.subscriptionsat the top of activation; every spawn site (MCPStdioClientTransport,ShellSession,AgentTerminalExecutorwhere applicable, custom-tool wrappers, future HTTP-bound servers) routes through the registry rather than callingchild_process.spawndirectly. Registry-level dispose triggers the close chain for every live PID in parallel, respecting the 3s budget. Per-session PID manifest at.sidecar/pids.json(gitignored, one line per PID:{ pid, command, args, cwd, spawnedAt, expectedPort?, sessionId }). Append on spawn, remove on clean exit, rotate on activation after the sweep completes. Startup orphan sweep reads the manifest from the prior session (if any) and for each listed PID: (a) probe liveness viaprocess.kill(pid, 0)(throwsESRCHwhen gone); (b) if alive, verify the process cmdline matches the stored spawn signature by reading/proc/<pid>/cmdlineon Linux /ps -o command= -p <pid>on macOS — protects against killing an unrelated PID that got recycled to the same number; (c) if ours and still alive, run the SIGTERM → SIGKILL chain. Sweep runs in parallel; results surface in the activation log ([SideCar] Cleaned 2 orphan MCP processes from prior session) and as aSideCar: Show Orphan Sweep Reportcommand for users who want the detail. Port-lock sweep for HTTP/SSE MCPs: when a configured URL points atlocalhost:<port>and a pre-bind probe finds the port already in use, look up the owner via platform-specific tooling (lsof -i :<port> -ton macOS/Linux,netstat -ano | findstr :<port>on Windows), check whether the owner PID is in our prior manifest — if yes, kill it and retry bind; if no, surface a clear error asking the user to free the port before continuing. MCPManager integration:disconnect()still calls the SDK’sclient.close()first (gives the protocol a chance to exit gracefully) but in aPromise.raceagainst a per-server timeout (default 2000ms, configurable viasidecar.mcpServers.<name>.closeTimeoutMs). On timeout, the underlyingManagedChildProcesstakes over with SIGTERM. Window reloads that previously orphaned a stdio server now complete within 3s with zero survivors. Composes with Shadow Workspaces: the git worktreedispose()path also runs through the registry so abandoned worktrees from crashed sessions get swept alongside process orphans. Composes with Audit Mode: a newprocess_lifecycleaudit event fires on every sweep (orphan killed, timeout triggered, kill chain completed) so admins auditing team environments can see the signal. Configured viasidecar.processLifecycle.enabled(defaulttrue; settingfalsefalls back to today’s best-effort dispose with a warning),sidecar.processLifecycle.closeTimeoutMs(default2000, clamped 500–10000),sidecar.processLifecycle.killTimeoutMs(default1000, clamped 200–5000),sidecar.processLifecycle.orphanSweep.enabled(defaulttrue),sidecar.processLifecycle.orphanSweep.reportOnActivation(defaulttrue— surfaces a toast when ≥1 orphan was cleaned; setfalsefor headless / CI environments), andsidecar.processLifecycle.portSweep.enabled(defaulttrue). Explicitly out of scope: process isolation sandboxing (cgroups, namespaces — OS-specific, belongs to later security work), resource quotas (CPU/RAM caps per child — vision-shelf item), cross-machine PID tracking for Dev Containers / SSH extension hosts (VS Code’s own lifecycle handles these — the extension host PID is the meaningful one, and VS Code kills it on disconnect). -
Hook Execution Hardening — Streaming spawn + activity-adaptive timeouts + unified env sanitization — closes three real failure modes in the two hook systems SideCar ships today. Current state: both
sidecar.hooks(per-tool pre/post at executor.ts:816-862) andsidecar.eventHooks(onSave/onCreate/onDelete at eventHooks.ts:83-108) wrapexecAsync— Node’sexecwith a promise adapter. Both enforce a fixed 15s timeout.eventHooks.tshas a localsanitizeEnvValue()that strips control characters (null bytes, newlines, ESC sequences) fromSIDECAR_FILE;sidecar.hooksappliesredactSecrets()toSIDECAR_INPUT/SIDECAR_OUTPUTbut does not strip control characters, so the two hook systems have inconsistent defenses against the same injection class. Three gaps this entry closes: (1) exec buffer overflow — Node’sexecdefaults to a 1 MB stdout cap; any hook producing more (a verbose test suite, a lint run with hundreds of findings, a Python script with a big traceback) crashes the hook withstdout maxBuffer length exceededand the agent loop sees a generic failure rather than the actual hook output. (2) fixed timeout with no adaptivity — a slow-but-workingnpm testpost-hook legitimately takes 45 seconds on a mid-size project; at 15s it gets killed even though stdout is streaming test progress the whole time. The agent loop interprets the timeout as a hook failure and either blocks (pre-hook) or warns (post-hook) when the hook was actually doing exactly what it should. (3) inconsistent env sanitization —SIDECAR_INPUTinsidecar.hookscan contain raw filename or tool-argument content with embedded ESC sequences or newlines that, under abash -c "echo $SIDECAR_INPUT"pattern, bleed into the shell’s handling of the variable.redactSecrets()catches credential-shaped content but doesn’t normalize control chars. UnifiedhookRunner.tsreplacesexecAsyncin both sites. Useschild_process.spawnwith piped stdio; reads stdout + stderr in chunks viadatalisteners; accumulates into a bounded ring buffer with explicit truncation semantics (default 10 MB cap viasidecar.hooks.maxOutputBytes, configurable; on overflow, drops the middle and keeps head + tail with a[... N bytes elided]marker, same pattern the existing prompt pruner uses fortool_resultblocks); surfaces the truncated-but-complete output to the caller on exit. Activity-adaptive timeout: initial budget fromsidecar.hooks.timeoutMs(default15000), a monotonic clock starts at spawn, eachdataevent from stdout or stderr resets a per-activity timer tosidecar.hooks.extendOnActivityMs(default5000). Hook is killed when either (a) initial budget elapses with zero output activity, or (b) total elapsed exceedssidecar.hooks.maxTimeoutMs(default300000, 5 min hard cap). A fast hook completes well under 15s; a slow-but-working hook that produces output every few seconds runs to completion up to the 5 min hard cap; a truly hung hook that goes silent gets killed at the initial 15s boundary. Configurable, but defaults are tuned for the common cases: lint/format run quickly, test suites take minutes with streaming output. Unified sanitization: extractsanitizeEnvValue()fromeventHooks.tsintosrc/agent/envSanitize.ts(new module, exports a single pure function) and apply it to every hook env var in both hook systems —SIDECAR_TOOL,SIDECAR_INPUT,SIDECAR_OUTPUT,SIDECAR_FILE,SIDECAR_EVENT.redactSecrets()still runs on top of sanitization for credential content. Same defense surface applied uniformly; fixes the inconsistency where eventHooks was hardened but tool-hooks weren’t. Hook children route throughManagedChildProcess(the Process Lifecycle Hardening primitive in the paired spec above) — so a hook that slips past every timeout and VS Code force-kills the extension host still gets cleaned up on next activation via the orphan sweep. Same registry, same PID manifest, same disposal guarantees. Composes with Audit Mode: theevent_hook:<event>audit entry already exists; this entry extends it with the newtruncated,killedBy: 'idle-timeout' | 'hard-cap' | 'caller-abort', andbytesReceivedfields so/auditqueries surface “hook was killed for going silent too long” vs. “hook produced 10 MB of output and was truncated” distinctly. Composes with Regression Guards: guard commands also use the hook-runner substrate so guards with streaming output (a long-running fuzz test, a numerical-invariant sweep) benefit from activity-adaptive timeouts too, without separate plumbing. Configured viasidecar.hooks.maxOutputBytes(default10_000_000, clamped 1_000_000–104_857_600),sidecar.hooks.timeoutMs(default15000, clamped 1000–60000 — the initial silent-budget),sidecar.hooks.extendOnActivityMs(default5000, clamped 1000–60000 — the per-chunk extension), andsidecar.hooks.maxTimeoutMs(default300000, clamped 15000–1800000 — the absolute hard cap). Ships in v0.70 as part of the runtime-correctness pass, paired with Process Lifecycle Hardening. - Bitbucket / Atlassian — Bitbucket REST API,
GitProviderinterface, auto-detect from remote URL OpenRouter — dedicated integration with model browsing, cost display, rate limit awareness→ shipped 2026-04-15 in v0.53.0. DedicatedOpenRouterBackendsubclass with referrer + title headers, rich catalog fetch vialistOpenRouterModels(), first-class entry inBUILT_IN_BACKEND_PROFILES, and a runtimeMODEL_COSTSoverlay populated from OpenRouter’s per-model pricing (no more hand-maintaining prices for hundreds of proxied models). Per-generation real cost tracking via/generation/{id}still deferred.- Browser automation — Playwright MCP for testing web apps
- Extension / plugin API (vision-shelf — superseded by
@sidecar/sdkabove) — the original bullet described the intent; the spec above is the concrete v0.73 implementation. -
Agentic Task Delegation via MCP — elevates MCP from a static tool registry into a dynamic sub-agent orchestration layer. Instead of treating every MCP server as a dumb function call, SideCar can spawn specialised servers on-demand (e.g. a
math-enginefor symbolic computation, aweb-searcherfor live retrieval, acode-executorsandbox) and route sub-tasks to them as first-class agents with their own reasoning loop. The lead agent decomposes the user’s request, dispatches sub-tasks to the most capable server via a newdelegate_to_mcptool call, collects structured results, and synthesises a final response — mirroring the hierarchical multi-agent pattern but using the MCP protocol as the inter-agent transport. Server lifecycle (spawn, health-check, teardown) is managed automatically, and each delegation is recorded in the audit log with the server name, input, output, and latency. Configurable viasidecar.mcpDelegation.enabledandsidecar.mcpDelegation.allowedServers. - Voice input (shipped v0.98.0) — microphone button in chat UI. Audio recorded in the VS Code extension host (Swift/AVFoundation on macOS, arecord on Linux, PowerShell+WinMM on Windows — no browser window). Transcribed locally via
@huggingface/transformersWhisper or via any HTTP Whisper endpoint. Gated bysidecar.voice.enabled.
Enterprise & Collaboration
- Centralized policy management —
.sidecar-policy.jsonfor org-level enforcement of approval modes, blocked tools, PII redaction, provider restrictions - Multi-User Agent Shadows — a shared agent knowledge base that lets every contributor’s SideCar instance start with the same learned project context. A team member runs
SideCar: Export Project Shadowto serialise the agent’s accumulated knowledge — coding standards, design tokens (colors, typography), mathematical definitions, architectural decisions, naming conventions — into a versioned.sidecar/shadow.jsonfile that is committed to the repo. When a new contributor opens the project, SideCar detects the shadow file and automatically imports it into their local memory store, so their instance already knows the project’s conventions without a single prompt. Entries are namespaced by category (standards,design,math,architecture) and can be individually pinned or overridden locally. Shadow exports are human-readable JSON so they can be reviewed and edited in PRs like any other config file. Controlled viasidecar.shadow.autoImport(default:true) andsidecar.shadow.autoExport(default:false— export is always an explicit user action to avoid leaking sensitive context). - Team knowledge base — built-in connectors for Confluence, Notion, internal docs
- Real-time collaboration Phase 1 — VS Code Live Share integration (shared chat, presence, host/guest roles)
- Real-time collaboration Phase 2 — shared agent control (multi-user approval, message attribution)
- Real-time collaboration Phase 3 — concurrent editing with CRDT/OT conflict resolution
- Real-time collaboration Phase 4 — standalone
@sidecar/collab-serverWebSocket package
Technical Debt
- Config sub-object grouping (30+ fields → sub-objects)
- Real tokenizer integration (
js-tiktokenfor accurate counting)