feat: history compression for long chat sessions #4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "feature/history-compression"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Long sessions with local models hit the context wall fast — by message 15 or so on an 8k model, every new tool result risks pushing the agent into the context-overflow zone where it either truncates silently or just stops answering coherently. This PR adds a sliding-window compressor that watches the projected input cost of each turn and, when it would cross a mode-aware threshold (50% Fast / 60% Normal / 70% Deep Research of the model's context window), collapses the older turns into a single ~150-word summary while keeping the last 4 messages verbatim. Wikilink citations like
[[Note Title]]are preserved through the summary so the assistant can still reason about previously-cited notes.The compressor never blocks a turn: if the summariser call fails (Ollama down, network blip, anything), it logs the error and falls back to the uncompressed history. The summary is cached on the frontend session and re-sent on the next turn, so subsequent turns extend the existing summary (one short LLM call) rather than re-summarising from scratch (one long LLM call). When the cache hash doesn't line up with the to-summarise prefix anymore, we fall back to fresh — no stale summary ever leaks into a turn.
You'll see a small
compressedchip in the telemetry panel's Token breakdown card whenever the most recent assistant turn ran through the compressor; hover for the before → after token counts and how many prior turns got collapsed.How the threshold check works
Each provider already reports the real
prompt_tokensandcompletion_tokenson the SSEDoneevent at the end of every turn. The frontend caches those counts on the session and re-sends them asprev_turn_tokenson the next/chat. The compressor uses:because everything in last turn's history is still in this turn's history, plus the last assistant reply (= prev_completion_tokens), plus whatever the user just typed. The system prompt's volatile suffix (date, top tags) drifts by under 10 tokens between turns and is well inside the 512-token budget buffer.
On the first turn of a session (no prior data) and for legacy clients that don't ship
prev_turn_tokens, we fall back to a conservative character heuristic (len * 2 / 7≈len / 3.5chars per token). That's intentionally pessimistic so we err toward triggering compression a turn early rather than overflowing.Why this matters
docs/superpowers/specs/2026-05-18-history-compression-design.mdanddocs/superpowers/plans/2026-05-18-history-compression.mdfor future reference.What's in the diff
Backend (
src-tauri/):src/agent/history.rs(~450 lines) —Compressor,Summarizertrait +ProviderSummarizer,HistorySummary,CompressionInfo,PrevTurnTokens, blake3 canonical-prefix hash, strict-prefix gap detection.agent/prompts.rs— one for fresh summarisation, one for extending a cached summary.compression_threshold_pctadded per-mode toModeBudgets.ChatEvent::Donenow carriescompression: Option<CompressionInfo>; SSE handler emits it conditionally (no\"compression\": nullnoise when absent).create_chat_streamruns the compressor right afterbuild_system_prompt; outcome plumbed ontoDonevia anArc<Mutex<Option<_>>>slot.ChatRequestaccepts optionalhistory_summaryandprev_turn_tokens(both#[serde(default)]for back-compat).Frontend (
src/):chat-client.ts— newHistorySummary,CompressionInfo,PrevTurnTokenstypes, request-body emission,message_completeforwarding.useChat.tsx+mock-data.ts— session cache forhistorySummaryandlastCompression;tokenUsage(already cached) is now also passed through asprevTurnTokenson every/chat.telemetry.tsx—compressedchip with native tooltip in the Token breakdown card header.Test plan
cargo test --lib agent::history— 14 tests: monotonic & conservative estimate, hash stability, hash differentiation, first-turn no-op, short-history no-op, over-threshold collapse with discriminating SUMMARIZATION-prompt assertion, cache reuse, cache extension with discriminating EXTEND-prompt assertion, cache discard, failure fallback, precise-token threshold trip, precise-token under-threshold no-opcargo test --lib chat_request_— 4 serde back-compat tests (with/withouthistory_summary, with/withoutprev_turn_tokens)cargo test --lib— 83 total backend tests passingcargo clippy --lib -- -D warningscleanbun run lintcleanbun run buildclean (static export, no type errors)Known limitations (deferred to v1b, documented in spec)
tokenUsagelive in frontend session state; app restart loses them (first turn after reload falls back to the heuristic until the nextDonearrives).Compressor.settingsfield is currently#[allow(dead_code)]— reserved for v1b per-vault overrides.