KV Cache Compression for Large Language Models

The #1 memory bottleneck in LLM inference — and how to fix it.

What is the KV cache?

Every transformer-based language model uses an attention mechanism to decide which previous tokens matter for predicting the next one. To avoid recomputing attention from scratch at every step, models store two matrices — Keys (K) and Values (V) — for every token they've seen. This is the KV cache.

The KV cache is what gives a model its "memory" during a conversation. Without it, generation would be quadratically slow. With it, generation is fast — but the cache grows linearly with every token.

Why KV cache is the bottleneck

For small contexts (a few hundred tokens), the KV cache is negligible. But modern models run at 32K, 128K, even 1M tokens. At those lengths, the cache dominates GPU memory:

CONTEXT LENGTH KV CACHE SIZE (7B MODEL) REALITY
4K tokens0.5 GBFits easily
32K tokens4.3 GBTight on one GPU
128K tokens17 GBExceeds model weights
512K tokens68 GBNeeds multiple GPUs just for cache

At 128K context, the cache is larger than the model itself. You're buying GPUs to store memory, not to compute.

For every dollar you spend on GPU memory for a 7B model at 32K context, roughly 40 cents goes to the KV cache. At 128K, it's 70 cents.

Why this matters for inference cost

GPU memory is the most expensive resource in LLM deployment. Every gigabyte of KV cache is a gigabyte that can't be used for batching more users, running longer conversations, or deploying larger models.

For production inference at scale — chatbots, coding assistants, document analysis, RAG pipelines — KV cache compression is the single highest-leverage memory optimization available today.

Why existing approaches fall short

The field has tried several strategies to shrink the KV cache. Each has a fundamental tradeoff:

Token eviction (SnapKV, TOVA, StreamingLLM)

Drop tokens the model "probably" doesn't need. The problem: you don't know what the model will need next. Eviction is irreversible. If a dropped token turns out to be critical later, the model hallucinates or loses coherence. Needle-in-a-haystack recall drops to 40–93%.

Rank reduction (SVD, low-rank projection)

Delete entire dimensions from the cache. This destroys information permanently. Even removing the "least important" dimensions causes catastrophic attention routing errors — the model reads the wrong tokens. PPL degrades by 4–364 points depending on aggressiveness.

Naive quantization (KIVI, round-to-nearest)

Reduce precision uniformly across all dimensions. Better than deletion, but without understanding which dimensions matter more, you waste precision budget on directions that don't contribute to prediction. PPL degrades by +0.27 to +1.00.

fraQtl: a different approach

fraQtl compresses the KV cache using a mathematically-derived importance metric that identifies exactly which cache dimensions carry predictive signal and which can be aggressively compressed.

The key insight: not all dimensions are equal. A small fraction of the cache carries almost all the information the model actually uses for prediction. fraQtl finds these dimensions automatically — in under a second — and allocates precision accordingly.

The result is a compressed model that is architecturally identical to the original. No hooks, no custom kernels, no code changes. Load it with standard HuggingFace transformers and serve normally.

Results

MODEL V-CACHE COMPRESSION PPL DELTA NIAH (375 TRIALS) OVERHEAD
Mistral 7B 3.5× +0.012 100% zero
Llama 3.2 3B 3.5× +0.014 ± 0.004 100% zero
Qwen 2.5 3B 3.5× +0.015 100% zero

V-cache compression across 3 models with 3-seed error bars. Early-run NIAH (375 trials, 2 models). The bake-off below shows the full C44 1080-trial retrieval comparison against competing methods.

vs. published methods — NIAH bake-off

Needle-in-a-haystack retrieval on Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method (1080 trials total per row, except H2O which OOMs at 16K+).

METHOD NIAH RETRIEVAL PPL DELTA (V-ONLY) OVERHEAD
fraQtl V-only 98.5% +0.012 zero
fraQtl V+K 97.8% zero
TOVA97.8%+0.259per-token
FP16 baseline97.0%0
SnapKV94.1%+0.214per-token
StreamingLLM37.8%+0.548per-token
H2O0.0% (OOM at 16K+)eager attn
KVQuant+0.27custom kernels
KIVI+1.00per-token

Downstream task accuracy

Compression doesn't just preserve perplexity — it preserves real task performance:

TASK FP16 BASELINE FRAQTL DELTA
SQuAD v2 (QA)58.5%60.4%+1.9%
TriviaQA (QA)28.6%31.3%+2.7%
CNN/DailyMail21.1%20.6%-0.5%
XSum20.1%23.1%+3.0%

3 out of 4 tasks improved. Average improvement: +1.8%.

How to use it

Option 1: Pre-compressed models (free)

Download from HuggingFace. Load with standard transformers. No fraQtl install needed.

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")
# That's it. 3.5× less KV memory. Same API.

Option 2: Runtime compression (early access)

Compress any HuggingFace model at inference time. One line. No retraining.

import fraqtl
fraqtl.enable_cache_compression(model, k=16, bits=3)
# Your model now uses 3.5× less V-cache memory.

Compress what matters.

3.5× V-cache compression. Retrieval preserved: 98.5% on 1080 long-context trials. Training-free. One line of code.

Get Early Access

Research

fraQtl is backed by two peer-reviewed preprints with validated results across 7 models (3B–70B parameters). The mathematical foundations are derived from first principles — not heuristics.

Patent pending · fraqtl.ai · contact@fraqtl.ai