feat: history compression for long chat sessions

hegdeatri commented

2026-05-18 21:22:41 +01:00

Owner

Long sessions with local models hit the context wall fast — by message 15 or so on an 8k model, every new tool result risks pushing the agent into the context-overflow zone where it either truncates silently or just stops answering coherently. This PR adds a sliding-window compressor that watches the projected input cost of each turn and, when it would cross a mode-aware threshold (50% Fast / 60% Normal / 70% Deep Research of the model's context window), collapses the older turns into a single ~150-word summary while keeping the last 4 messages verbatim. Wikilink citations like [[Note Title]] are preserved through the summary so the assistant can still reason about previously-cited notes.

The compressor never blocks a turn: if the summariser call fails (Ollama down, network blip, anything), it logs the error and falls back to the uncompressed history. The summary is cached on the frontend session and re-sent on the next turn, so subsequent turns extend the existing summary (one short LLM call) rather than re-summarising from scratch (one long LLM call). When the cache hash doesn't line up with the to-summarise prefix anymore, we fall back to fresh — no stale summary ever leaks into a turn.

You'll see a small compressed chip in the telemetry panel's Token breakdown card whenever the most recent assistant turn ran through the compressor; hover for the before → after token counts and how many prior turns got collapsed.

How the threshold check works

Each provider already reports the real prompt_tokens and completion_tokens on the SSE Done event at the end of every turn. The frontend caches those counts on the session and re-sends them as prev_turn_tokens on the next /chat. The compressor uses:

projected_input ≈ prev_prompt_tokens + prev_completion_tokens + estimate(new_user_message)

because everything in last turn's history is still in this turn's history, plus the last assistant reply (= prev_completion_tokens), plus whatever the user just typed. The system prompt's volatile suffix (date, top tags) drifts by under 10 tokens between turns and is well inside the 512-token budget buffer.

On the first turn of a session (no prior data) and for legacy clients that don't ship prev_turn_tokens, we fall back to a conservative character heuristic (len * 2 / 7 ≈ len / 3.5 chars per token). That's intentionally pessimistic so we err toward triggering compression a turn early rather than overflowing.

Why this matters

Unblocks long chat threads on local 8–16k models without the user having to manually clear context or start a new thread.
Threshold accuracy is provider-grade, not heuristic-grade — the compressor uses the same token counts OpenRouter / Ollama / OpenAI-compatible servers report back. Tool-result heavy turns no longer get under-counted.
No regression for short sessions — the no-op fast paths (empty history, history ≤ 4 messages, total under threshold) bail before any LLM call.
Cheaper per-turn cost on extended sessions — cached summaries get extended with a short LLM call instead of being regenerated.
Plays nicely with the existing modes work (PR !1) — Fast triggers earliest because users want snappy responses; Deep Research triggers latest because it tolerates verbose context.
Spec + plan are committed alongside at docs/superpowers/specs/2026-05-18-history-compression-design.md and docs/superpowers/plans/2026-05-18-history-compression.md for future reference.

What's in the diff

Backend (src-tauri/):

New src/agent/history.rs (~450 lines) — Compressor, Summarizer trait + ProviderSummarizer, HistorySummary, CompressionInfo, PrevTurnTokens, blake3 canonical-prefix hash, strict-prefix gap detection.
Two new system prompts in agent/prompts.rs — one for fresh summarisation, one for extending a cached summary.
compression_threshold_pct added per-mode to ModeBudgets.
ChatEvent::Done now carries compression: Option<CompressionInfo>; SSE handler emits it conditionally (no \"compression\": null noise when absent).
create_chat_stream runs the compressor right after build_system_prompt; outcome plumbed onto Done via an Arc<Mutex<Option<_>>> slot.
ChatRequest accepts optional history_summary and prev_turn_tokens (both #[serde(default)] for back-compat).

Frontend (src/):

chat-client.ts — new HistorySummary, CompressionInfo, PrevTurnTokens types, request-body emission, message_complete forwarding.
useChat.tsx + mock-data.ts — session cache for historySummary and lastCompression; tokenUsage (already cached) is now also passed through as prevTurnTokens on every /chat.
telemetry.tsx — compressed chip with native tooltip in the Token breakdown card header.

Test plan

cargo test --lib agent::history — 14 tests: monotonic & conservative estimate, hash stability, hash differentiation, first-turn no-op, short-history no-op, over-threshold collapse with discriminating SUMMARIZATION-prompt assertion, cache reuse, cache extension with discriminating EXTEND-prompt assertion, cache discard, failure fallback, precise-token threshold trip, precise-token under-threshold no-op
cargo test --lib chat_request_ — 4 serde back-compat tests (with/without history_summary, with/without prev_turn_tokens)
cargo test --lib — 83 total backend tests passing
cargo clippy --lib -- -D warnings clean
bun run lint clean
bun run build clean (static export, no type errors)
Manual: send 10-ish turns at a local 8k model, watch the compressor trip, hover the chip, confirm a follow-up turn extends the cached summary rather than regenerating (left for you to drive against your live vault)

Known limitations (deferred to v1b, documented in spec)

Tool-result heavy turns inside the keep window stay verbatim — tool-result stubbing for the keep window is v1b.
Summariser uses the same chat model → GPU contention on single-GPU local setups. Future work could pin it to a smaller dedicated model.
Summary cache + tokenUsage live in frontend session state; app restart loses them (first turn after reload falls back to the heuristic until the next Done arrives).
Compressor.settings field is currently #[allow(dead_code)] — reserved for v1b per-vault overrides.

Long sessions with local models hit the context wall fast — by message 15 or so on an 8k model, every new tool result risks pushing the agent into the context-overflow zone where it either truncates silently or just stops answering coherently. This PR adds a sliding-window compressor that watches the projected input cost of each turn and, when it would cross a mode-aware threshold (50% Fast / 60% Normal / 70% Deep Research of the model's context window), collapses the older turns into a single ~150-word summary while keeping the last 4 messages verbatim. Wikilink citations like `[[Note Title]]` are preserved through the summary so the assistant can still reason about previously-cited notes. The compressor never blocks a turn: if the summariser call fails (Ollama down, network blip, anything), it logs the error and falls back to the uncompressed history. The summary is cached on the frontend session and re-sent on the next turn, so subsequent turns extend the existing summary (one short LLM call) rather than re-summarising from scratch (one long LLM call). When the cache hash doesn't line up with the to-summarise prefix anymore, we fall back to fresh — no stale summary ever leaks into a turn. You'll see a small `compressed` chip in the telemetry panel's Token breakdown card whenever the most recent assistant turn ran through the compressor; hover for the before → after token counts and how many prior turns got collapsed. ## How the threshold check works Each provider already reports the real `prompt_tokens` and `completion_tokens` on the SSE `Done` event at the end of every turn. The frontend caches those counts on the session and re-sends them as `prev_turn_tokens` on the next `/chat`. The compressor uses: ``` projected_input ≈ prev_prompt_tokens + prev_completion_tokens + estimate(new_user_message) ``` because everything in last turn's history is still in this turn's history, plus the last assistant reply (= prev_completion_tokens), plus whatever the user just typed. The system prompt's volatile suffix (date, top tags) drifts by under 10 tokens between turns and is well inside the 512-token budget buffer. On the first turn of a session (no prior data) and for legacy clients that don't ship `prev_turn_tokens`, we fall back to a conservative character heuristic (`len * 2 / 7` ≈ `len / 3.5` chars per token). That's intentionally pessimistic so we err toward triggering compression a turn early rather than overflowing. ## Why this matters - **Unblocks long chat threads on local 8–16k models** without the user having to manually clear context or start a new thread. - **Threshold accuracy is provider-grade, not heuristic-grade** — the compressor uses the same token counts OpenRouter / Ollama / OpenAI-compatible servers report back. Tool-result heavy turns no longer get under-counted. - **No regression for short sessions** — the no-op fast paths (empty history, history ≤ 4 messages, total under threshold) bail before any LLM call. - **Cheaper per-turn cost on extended sessions** — cached summaries get extended with a short LLM call instead of being regenerated. - **Plays nicely with the existing modes work (PR !1)** — Fast triggers earliest because users want snappy responses; Deep Research triggers latest because it tolerates verbose context. - **Spec + plan are committed alongside** at `docs/superpowers/specs/2026-05-18-history-compression-design.md` and `docs/superpowers/plans/2026-05-18-history-compression.md` for future reference. ## What's in the diff **Backend (`src-tauri/`):** - New `src/agent/history.rs` (~450 lines) — `Compressor`, `Summarizer` trait + `ProviderSummarizer`, `HistorySummary`, `CompressionInfo`, `PrevTurnTokens`, blake3 canonical-prefix hash, strict-prefix gap detection. - Two new system prompts in `agent/prompts.rs` — one for fresh summarisation, one for extending a cached summary. - `compression_threshold_pct` added per-mode to `ModeBudgets`. - `ChatEvent::Done` now carries `compression: Option<CompressionInfo>`; SSE handler emits it conditionally (no `\"compression\": null` noise when absent). - `create_chat_stream` runs the compressor right after `build_system_prompt`; outcome plumbed onto `Done` via an `Arc<Mutex<Option<_>>>` slot. - `ChatRequest` accepts optional `history_summary` and `prev_turn_tokens` (both `#[serde(default)]` for back-compat). **Frontend (`src/`):** - `chat-client.ts` — new `HistorySummary`, `CompressionInfo`, `PrevTurnTokens` types, request-body emission, `message_complete` forwarding. - `useChat.tsx` + `mock-data.ts` — session cache for `historySummary` and `lastCompression`; `tokenUsage` (already cached) is now also passed through as `prevTurnTokens` on every `/chat`. - `telemetry.tsx` — `compressed` chip with native tooltip in the Token breakdown card header. ## Test plan - [x] `cargo test --lib agent::history` — 14 tests: monotonic & conservative estimate, hash stability, hash differentiation, first-turn no-op, short-history no-op, over-threshold collapse with discriminating SUMMARIZATION-prompt assertion, cache reuse, cache extension with discriminating EXTEND-prompt assertion, cache discard, failure fallback, precise-token threshold trip, precise-token under-threshold no-op - [x] `cargo test --lib chat_request_` — 4 serde back-compat tests (with/without `history_summary`, with/without `prev_turn_tokens`) - [x] `cargo test --lib` — 83 total backend tests passing - [x] `cargo clippy --lib -- -D warnings` clean - [x] `bun run lint` clean - [x] `bun run build` clean (static export, no type errors) - [ ] Manual: send 10-ish turns at a local 8k model, watch the compressor trip, hover the chip, confirm a follow-up turn extends the cached summary rather than regenerating (left for you to drive against your live vault) ## Known limitations (deferred to v1b, documented in spec) - Tool-result heavy turns inside the keep window stay verbatim — tool-result stubbing for the keep window is v1b. - Summariser uses the same chat model → GPU contention on single-GPU local setups. Future work could pin it to a smaller dedicated model. - Summary cache + `tokenUsage` live in frontend session state; app restart loses them (first turn after reload falls back to the heuristic until the next `Done` arrives). - `Compressor.settings` field is currently `#[allow(dead_code)]` — reserved for v1b per-vault overrides.

hegdeatri added 18 commits

2026-05-18 21:22:41 +01:00

feat(config): add compression_threshold_pct to ModeBudgets 6454bfa736

feat(prompts): SUMMARIZATION + EXTEND_SUMMARIZATION system prompts 4c96990cc6

feat(history): scaffold Compressor types + estimate_tokens 1ef3cb4bbb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(history): no-op fast paths for first turn and short history df898a79f0

feat(history): threshold-tripped path with fresh summarisation e38f9a7998

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(history): cache reuse on exact hash match 4571e9163c

feat(history): cache extension on strict-prefix match d5498b043c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test(history): discriminate fresh vs. extension by system_prompt f550a98c63

Capture last_system_prompt in FakeSummarizer and assert on it in
both the fresh-summarisation and cache-extension tests, so the two
paths produce different observable outcomes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(history): cache-discard and summariser-failure regression tests 5f48a9d781

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(chat-event): add compression field to Done variant a44e70498b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(chat): plumb history compressor into create_chat_stream fd9338087c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(history): allow-dead-code on Compressor.settings reserved for v1b 894aa172d2

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(routes): accept optional history_summary on ChatRequest 8e79802c5e

fix(history): use div_ceil to satisfy clippy b42a00cade

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(chat-client): HistorySummary + CompressionInfo types on /chat 23c634d51c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(useChat): cache historySummary per session and re-send on next turn 673a80aadc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(telemetry): compressed chip with tooltip in token breakdown a835b7c7b3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(todo): tier 2 history compression shipped f6c8c88211

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hegdeatri added 1 commit

2026-05-18 21:37:01 +01:00

feat(history): use real prev-turn tokens for compressor threshold 9464edadbc

The character heuristic (len/3.5) systematically under-counts on
tool-result heavy turns and ignores model-specific tokenisation.
The previous turn's prompt_tokens + completion_tokens is already
reported by every provider on Done — re-send it from the frontend
session cache and the compressor can project the next turn's
input as: prev_prompt + prev_completion + tokens(new_user_message).

Falls back to the character heuristic on the first turn of a
session (no prior data) and on legacy clients that don't ship
prev_turn_tokens. 4 new tests cover the precise path + serde
back-compat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hegdeatri merged commit 0f93abc281 into master

2026-05-18 21:50:37 +01:00

hegdeatri deleted branch feature/history-compression

2026-05-18 21:50:38 +01:00

hegdeatri referenced this pull request from a commit

2026-05-18 21:50:39 +01:00

Merge pull request 'feat: history compression for long chat sessions' (#4) from feature/history-compression into master

Rows
Columns

feat: history compression for long chat sessions #4

How the threshold check works

Why this matters

What's in the diff

Test plan

Known limitations (deferred to v1b, documented in spec)