KV CACHE COMPRESSION

Longer context. Less memory. Retrieval intact.

At 128K context, standard Q8 KV lost the needle. fraQtl used less memory than Q8 — and retrieved it.

Mistral-7B · 128K context · llama.cpp · 5-needle retrieval
fp16 KV (baseline) 22.7 GB 5 / 5 retrieval
Q8 KV 15.4 GB 1 / 5 retrieval
fraQtl D1 13.3 GB 5 / 5 retrieval

Below Q8 memory, fp16-level retrieval — in a real long-context run. Standard low-bit KV saves memory but loses the needle; fraQtl saves memory and keeps it. See the full receipt →

What is the KV cache?

Every transformer-based language model uses an attention mechanism to decide which previous tokens matter for predicting the next one. To avoid recomputing attention from scratch at every step, models store two matrices — Keys (K) and Values (V) — for every token they've seen. This is the KV cache.

The KV cache is what gives a model its "memory" during a conversation. Without it, generation would be quadratically slow. With it, generation is fast — but the cache grows linearly with every token.

Why KV cache is the bottleneck

For small contexts (a few hundred tokens), the KV cache is negligible. But long-context workloads run at 32K and 128K+. At those lengths, the cache dominates GPU memory:

CONTEXT LENGTH KV CACHE SIZE (7B MODEL) REALITY
4K tokens0.5 GBFits easily
32K tokens4.3 GBTight on one GPU
128K tokens17 GBExceeds model weights
512K tokens68 GBNeeds multiple GPUs just for cache

At 128K context, the cache is larger than the model itself. You're buying GPUs to store memory, not to compute.

For every dollar you spend on GPU memory for a 7B model at 32K context, roughly 40 cents goes to the KV cache. At 128K, it's 70 cents.

Why this matters for inference cost

GPU memory is the most expensive resource in LLM deployment. Every gigabyte of KV cache is a gigabyte that can't be used for batching more users, running longer conversations, or deploying larger models.

For production inference at scale — chatbots, coding assistants, document analysis, RAG pipelines — KV cache compression is one of the highest-leverage memory optimizations for long-context inference.

Why existing approaches fall short

The field has tried several strategies to shrink the KV cache. Each has a fundamental tradeoff:

Token eviction (SnapKV, TOVA, StreamingLLM)

Drop tokens the model "probably" doesn't need. The problem: you don't know what the model will need next. Eviction is irreversible. If a dropped token turns out to be critical later, the model hallucinates or loses coherence. Needle-in-a-haystack recall drops to 40–93%.

Rank reduction (SVD, low-rank projection)

Delete entire dimensions from the cache. This destroys information permanently. Even removing the "least important" dimensions causes catastrophic attention routing errors — the model reads the wrong tokens. PPL degrades by 4–364 points depending on aggressiveness.

Naive quantization (KIVI, round-to-nearest)

Reduce precision uniformly across all dimensions. Better than deletion, but without understanding which dimensions matter more, you waste precision budget on directions that don't contribute to prediction. PPL degrades by +0.27 to +1.00.

fraQtl: a different approach

fraQtl compresses the KV cache using a mathematically-derived importance metric that estimates which directions carry downstream signal and which can be aggressively compressed.

The key insight: not all dimensions are equal. A small fraction of the cache carries most of the information the model actually uses for prediction. fraQtl calibrates an importance-aware basis and allocates precision toward the directions most likely to affect downstream behavior.

For pre-compressed artifacts, the model loads through standard Hugging Face workflows — no code changes. Runtime KV compression is available separately in early access.

Research backing

The headline above is the customer claim. Underneath it is the research that makes it work — V-cache compression results across architectures, and a retrieval bake-off against published methods. These are lab/research numbers, reported for transparency, not the front-page promise.

MODEL V-CACHE COMPRESSION PPL DELTA NIAH (375 TRIALS)
Mistral 7B 3.5× +0.012 100%
Llama 3.2 3B 3.5× +0.014 ± 0.004 100%
Qwen 2.5 3B 3.5× +0.015 100%

V-cache compression across 3 models with 3-seed error bars. Early-run NIAH (375 trials, 2 models). The bake-off below shows the full C44 1080-trial retrieval comparison against competing methods.

vs. published methods — NIAH bake-off

Needle-in-a-haystack retrieval on Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method (1080 trials total per row, except H2O which OOMs at 16K+).

METHOD NIAH RETRIEVAL PPL DELTA (V-ONLY)
fraQtl V-only 98.5% +0.012
fraQtl V+K 97.8%
TOVA97.8%+0.259
FP16 baseline97.0%0
SnapKV94.1%+0.214
StreamingLLM37.8%+0.548
H2O0.0% (OOM at 16K+)
KVQuant+0.27
KIVI+1.00

How to use it

Compressed artifacts. Published fraQtl-compressed models load through the standard Hugging Face / Transformers workflow — no custom install. See the current public artifacts on Hugging Face.

Runtime KV compression. Available in early access. We calibrate on your workload and benchmark against your own FP16 baseline before anything ships — bring your model and we'll confirm support.

Long-context memory without losing the needle.

Below Q8 memory, fp16-level retrieval — built for the failure mode where uniform low-bit KV loses recall. Bring your model; we benchmark against your stack.

Get Early Access

Research

fraQtl is backed by two preprints (peer review in progress), validated across multiple transformer families with public receipts reported per model. The mathematical foundations are derived from first principles — not heuristics.

fraqtl.ai · contact@fraqtl.ai