feat: history compression for long chat sessions #4

Merged
hegdeatri merged 19 commits from feature/history-compression into master 2026-05-18 21:50:37 +01:00
Owner

Long sessions with local models hit the context wall fast — by message 15 or so on an 8k model, every new tool result risks pushing the agent into the context-overflow zone where it either truncates silently or just stops answering coherently. This PR adds a sliding-window compressor that watches the projected input cost of each turn and, when it would cross a mode-aware threshold (50% Fast / 60% Normal / 70% Deep Research of the model's context window), collapses the older turns into a single ~150-word summary while keeping the last 4 messages verbatim. Wikilink citations like [[Note Title]] are preserved through the summary so the assistant can still reason about previously-cited notes.

The compressor never blocks a turn: if the summariser call fails (Ollama down, network blip, anything), it logs the error and falls back to the uncompressed history. The summary is cached on the frontend session and re-sent on the next turn, so subsequent turns extend the existing summary (one short LLM call) rather than re-summarising from scratch (one long LLM call). When the cache hash doesn't line up with the to-summarise prefix anymore, we fall back to fresh — no stale summary ever leaks into a turn.

You'll see a small compressed chip in the telemetry panel's Token breakdown card whenever the most recent assistant turn ran through the compressor; hover for the before → after token counts and how many prior turns got collapsed.

How the threshold check works

Each provider already reports the real prompt_tokens and completion_tokens on the SSE Done event at the end of every turn. The frontend caches those counts on the session and re-sends them as prev_turn_tokens on the next /chat. The compressor uses:

projected_input ≈ prev_prompt_tokens + prev_completion_tokens + estimate(new_user_message)

because everything in last turn's history is still in this turn's history, plus the last assistant reply (= prev_completion_tokens), plus whatever the user just typed. The system prompt's volatile suffix (date, top tags) drifts by under 10 tokens between turns and is well inside the 512-token budget buffer.

On the first turn of a session (no prior data) and for legacy clients that don't ship prev_turn_tokens, we fall back to a conservative character heuristic (len * 2 / 7len / 3.5 chars per token). That's intentionally pessimistic so we err toward triggering compression a turn early rather than overflowing.

Why this matters

  • Unblocks long chat threads on local 8–16k models without the user having to manually clear context or start a new thread.
  • Threshold accuracy is provider-grade, not heuristic-grade — the compressor uses the same token counts OpenRouter / Ollama / OpenAI-compatible servers report back. Tool-result heavy turns no longer get under-counted.
  • No regression for short sessions — the no-op fast paths (empty history, history ≤ 4 messages, total under threshold) bail before any LLM call.
  • Cheaper per-turn cost on extended sessions — cached summaries get extended with a short LLM call instead of being regenerated.
  • Plays nicely with the existing modes work (PR !1) — Fast triggers earliest because users want snappy responses; Deep Research triggers latest because it tolerates verbose context.
  • Spec + plan are committed alongside at docs/superpowers/specs/2026-05-18-history-compression-design.md and docs/superpowers/plans/2026-05-18-history-compression.md for future reference.

What's in the diff

Backend (src-tauri/):

  • New src/agent/history.rs (~450 lines) — Compressor, Summarizer trait + ProviderSummarizer, HistorySummary, CompressionInfo, PrevTurnTokens, blake3 canonical-prefix hash, strict-prefix gap detection.
  • Two new system prompts in agent/prompts.rs — one for fresh summarisation, one for extending a cached summary.
  • compression_threshold_pct added per-mode to ModeBudgets.
  • ChatEvent::Done now carries compression: Option<CompressionInfo>; SSE handler emits it conditionally (no \"compression\": null noise when absent).
  • create_chat_stream runs the compressor right after build_system_prompt; outcome plumbed onto Done via an Arc<Mutex<Option<_>>> slot.
  • ChatRequest accepts optional history_summary and prev_turn_tokens (both #[serde(default)] for back-compat).

Frontend (src/):

  • chat-client.ts — new HistorySummary, CompressionInfo, PrevTurnTokens types, request-body emission, message_complete forwarding.
  • useChat.tsx + mock-data.ts — session cache for historySummary and lastCompression; tokenUsage (already cached) is now also passed through as prevTurnTokens on every /chat.
  • telemetry.tsxcompressed chip with native tooltip in the Token breakdown card header.

Test plan

  • cargo test --lib agent::history — 14 tests: monotonic & conservative estimate, hash stability, hash differentiation, first-turn no-op, short-history no-op, over-threshold collapse with discriminating SUMMARIZATION-prompt assertion, cache reuse, cache extension with discriminating EXTEND-prompt assertion, cache discard, failure fallback, precise-token threshold trip, precise-token under-threshold no-op
  • cargo test --lib chat_request_ — 4 serde back-compat tests (with/without history_summary, with/without prev_turn_tokens)
  • cargo test --lib — 83 total backend tests passing
  • cargo clippy --lib -- -D warnings clean
  • bun run lint clean
  • bun run build clean (static export, no type errors)
  • Manual: send 10-ish turns at a local 8k model, watch the compressor trip, hover the chip, confirm a follow-up turn extends the cached summary rather than regenerating (left for you to drive against your live vault)

Known limitations (deferred to v1b, documented in spec)

  • Tool-result heavy turns inside the keep window stay verbatim — tool-result stubbing for the keep window is v1b.
  • Summariser uses the same chat model → GPU contention on single-GPU local setups. Future work could pin it to a smaller dedicated model.
  • Summary cache + tokenUsage live in frontend session state; app restart loses them (first turn after reload falls back to the heuristic until the next Done arrives).
  • Compressor.settings field is currently #[allow(dead_code)] — reserved for v1b per-vault overrides.
Long sessions with local models hit the context wall fast — by message 15 or so on an 8k model, every new tool result risks pushing the agent into the context-overflow zone where it either truncates silently or just stops answering coherently. This PR adds a sliding-window compressor that watches the projected input cost of each turn and, when it would cross a mode-aware threshold (50% Fast / 60% Normal / 70% Deep Research of the model's context window), collapses the older turns into a single ~150-word summary while keeping the last 4 messages verbatim. Wikilink citations like `[[Note Title]]` are preserved through the summary so the assistant can still reason about previously-cited notes. The compressor never blocks a turn: if the summariser call fails (Ollama down, network blip, anything), it logs the error and falls back to the uncompressed history. The summary is cached on the frontend session and re-sent on the next turn, so subsequent turns extend the existing summary (one short LLM call) rather than re-summarising from scratch (one long LLM call). When the cache hash doesn't line up with the to-summarise prefix anymore, we fall back to fresh — no stale summary ever leaks into a turn. You'll see a small `compressed` chip in the telemetry panel's Token breakdown card whenever the most recent assistant turn ran through the compressor; hover for the before → after token counts and how many prior turns got collapsed. ## How the threshold check works Each provider already reports the real `prompt_tokens` and `completion_tokens` on the SSE `Done` event at the end of every turn. The frontend caches those counts on the session and re-sends them as `prev_turn_tokens` on the next `/chat`. The compressor uses: ``` projected_input ≈ prev_prompt_tokens + prev_completion_tokens + estimate(new_user_message) ``` because everything in last turn's history is still in this turn's history, plus the last assistant reply (= prev_completion_tokens), plus whatever the user just typed. The system prompt's volatile suffix (date, top tags) drifts by under 10 tokens between turns and is well inside the 512-token budget buffer. On the first turn of a session (no prior data) and for legacy clients that don't ship `prev_turn_tokens`, we fall back to a conservative character heuristic (`len * 2 / 7` ≈ `len / 3.5` chars per token). That's intentionally pessimistic so we err toward triggering compression a turn early rather than overflowing. ## Why this matters - **Unblocks long chat threads on local 8–16k models** without the user having to manually clear context or start a new thread. - **Threshold accuracy is provider-grade, not heuristic-grade** — the compressor uses the same token counts OpenRouter / Ollama / OpenAI-compatible servers report back. Tool-result heavy turns no longer get under-counted. - **No regression for short sessions** — the no-op fast paths (empty history, history ≤ 4 messages, total under threshold) bail before any LLM call. - **Cheaper per-turn cost on extended sessions** — cached summaries get extended with a short LLM call instead of being regenerated. - **Plays nicely with the existing modes work (PR !1)** — Fast triggers earliest because users want snappy responses; Deep Research triggers latest because it tolerates verbose context. - **Spec + plan are committed alongside** at `docs/superpowers/specs/2026-05-18-history-compression-design.md` and `docs/superpowers/plans/2026-05-18-history-compression.md` for future reference. ## What's in the diff **Backend (`src-tauri/`):** - New `src/agent/history.rs` (~450 lines) — `Compressor`, `Summarizer` trait + `ProviderSummarizer`, `HistorySummary`, `CompressionInfo`, `PrevTurnTokens`, blake3 canonical-prefix hash, strict-prefix gap detection. - Two new system prompts in `agent/prompts.rs` — one for fresh summarisation, one for extending a cached summary. - `compression_threshold_pct` added per-mode to `ModeBudgets`. - `ChatEvent::Done` now carries `compression: Option<CompressionInfo>`; SSE handler emits it conditionally (no `\"compression\": null` noise when absent). - `create_chat_stream` runs the compressor right after `build_system_prompt`; outcome plumbed onto `Done` via an `Arc<Mutex<Option<_>>>` slot. - `ChatRequest` accepts optional `history_summary` and `prev_turn_tokens` (both `#[serde(default)]` for back-compat). **Frontend (`src/`):** - `chat-client.ts` — new `HistorySummary`, `CompressionInfo`, `PrevTurnTokens` types, request-body emission, `message_complete` forwarding. - `useChat.tsx` + `mock-data.ts` — session cache for `historySummary` and `lastCompression`; `tokenUsage` (already cached) is now also passed through as `prevTurnTokens` on every `/chat`. - `telemetry.tsx` — `compressed` chip with native tooltip in the Token breakdown card header. ## Test plan - [x] `cargo test --lib agent::history` — 14 tests: monotonic & conservative estimate, hash stability, hash differentiation, first-turn no-op, short-history no-op, over-threshold collapse with discriminating SUMMARIZATION-prompt assertion, cache reuse, cache extension with discriminating EXTEND-prompt assertion, cache discard, failure fallback, precise-token threshold trip, precise-token under-threshold no-op - [x] `cargo test --lib chat_request_` — 4 serde back-compat tests (with/without `history_summary`, with/without `prev_turn_tokens`) - [x] `cargo test --lib` — 83 total backend tests passing - [x] `cargo clippy --lib -- -D warnings` clean - [x] `bun run lint` clean - [x] `bun run build` clean (static export, no type errors) - [ ] Manual: send 10-ish turns at a local 8k model, watch the compressor trip, hover the chip, confirm a follow-up turn extends the cached summary rather than regenerating (left for you to drive against your live vault) ## Known limitations (deferred to v1b, documented in spec) - Tool-result heavy turns inside the keep window stay verbatim — tool-result stubbing for the keep window is v1b. - Summariser uses the same chat model → GPU contention on single-GPU local setups. Future work could pin it to a smaller dedicated model. - Summary cache + `tokenUsage` live in frontend session state; app restart loses them (first turn after reload falls back to the heuristic until the next `Done` arrives). - `Compressor.settings` field is currently `#[allow(dead_code)]` — reserved for v1b per-vault overrides.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Capture last_system_prompt in FakeSummarizer and assert on it in
both the fresh-summarisation and cache-extension tests, so the two
paths produce different observable outcomes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The character heuristic (len/3.5) systematically under-counts on
tool-result heavy turns and ignores model-specific tokenisation.
The previous turn's prompt_tokens + completion_tokens is already
reported by every provider on Done — re-send it from the frontend
session cache and the compressor can project the next turn's
input as: prev_prompt + prev_completion + tokens(new_user_message).

Falls back to the character heuristic on the first turn of a
session (no prior data) and on legacy clients that don't ship
prev_turn_tokens. 4 new tests cover the precise path + serde
back-compat.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
hegdeatri merged commit 0f93abc281 into master 2026-05-18 21:50:37 +01:00
hegdeatri deleted branch feature/history-compression 2026-05-18 21:50:38 +01:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
hegdeatri/pkma-rs!4
No description provided.