KV Cache Compression for Large Language Models

The #1 memory bottleneck in LLM inference — and how to fix it.

What is the KV cache?

Every transformer-based language model uses an attention mechanism to decide which previous tokens matter for predicting the next one. To avoid recomputing attention from scratch at every step, models store two matrices — Keys (K) and Values (V) — for every token they've seen. This is the KV cache.

The KV cache is what gives a model its "memory" during a conversation. Without it, generation would be quadratically slow. With it, generation is fast — but the cache grows linearly with every token.

Why KV cache is the bottleneck

For small contexts (a few hundred tokens), the KV cache is negligible. But modern models run at 32K, 128K, even 1M tokens. At those lengths, the cache dominates GPU memory:

CONTEXT LENGTH	KV CACHE SIZE (7B MODEL)	REALITY
4K tokens	0.5 GB	Fits easily
32K tokens	4.3 GB	Tight on one GPU
128K tokens	17 GB	Exceeds model weights
512K tokens	68 GB	Needs multiple GPUs just for cache

At 128K context, the cache is larger than the model itself. You're buying GPUs to store memory, not to compute.

For every dollar you spend on GPU memory for a 7B model at 32K context, roughly 40 cents goes to the KV cache. At 128K, it's 70 cents.

Why this matters for inference cost

GPU memory is the most expensive resource in LLM deployment. Every gigabyte of KV cache is a gigabyte that can't be used for batching more users, running longer conversations, or deploying larger models.

Memory = hardware cost. At 128K context, a 7B model needs 2 GPUs just for cache. fraQtl puts it back on one. That's a 50% hardware reduction.
Memory = throughput. Smaller cache means more concurrent users on the same GPU. 3.5× compression translates directly to higher batch sizes and lower cost-per-token.
Memory = context length. The maximum conversation length your model supports is gated by KV cache size. Compress it, and the same GPU serves 3.5× longer conversations.

For production inference at scale — chatbots, coding assistants, document analysis, RAG pipelines — KV cache compression is the single highest-leverage memory optimization available today.

Why existing approaches fall short

The field has tried several strategies to shrink the KV cache. Each has a fundamental tradeoff:

Token eviction (SnapKV, TOVA, StreamingLLM)

Drop tokens the model "probably" doesn't need. The problem: you don't know what the model will need next. Eviction is irreversible. If a dropped token turns out to be critical later, the model hallucinates or loses coherence. Needle-in-a-haystack recall drops to 40–93%.

Rank reduction (SVD, low-rank projection)

Delete entire dimensions from the cache. This destroys information permanently. Even removing the "least important" dimensions causes catastrophic attention routing errors — the model reads the wrong tokens. PPL degrades by 4–364 points depending on aggressiveness.

Naive quantization (KIVI, round-to-nearest)

Reduce precision uniformly across all dimensions. Better than deletion, but without understanding which dimensions matter more, you waste precision budget on directions that don't contribute to prediction. PPL degrades by +0.27 to +1.00.

fraQtl: a different approach

fraQtl compresses the KV cache using a mathematically-derived importance metric that identifies exactly which cache dimensions carry predictive signal and which can be aggressively compressed.

The key insight: not all dimensions are equal. A small fraction of the cache carries almost all the information the model actually uses for prediction. fraQtl finds these dimensions automatically — in under a second — and allocates precision accordingly.

The result is a compressed model that is architecturally identical to the original. No hooks, no custom kernels, no code changes. Load it with standard HuggingFace transformers and serve normally.

Results

MODEL	V-CACHE COMPRESSION	PPL DELTA	NIAH (375 TRIALS)	OVERHEAD
Mistral 7B	3.5×	+0.012	100%	zero
Llama 3.2 3B	3.5×	+0.014 ± 0.004	100%	zero
Qwen 2.5 3B	3.5×	+0.015	100%	zero

V-cache compression across 3 models with 3-seed error bars. Early-run NIAH (375 trials, 2 models). The bake-off below shows the full C44 1080-trial retrieval comparison against competing methods.

vs. published methods — NIAH bake-off

Needle-in-a-haystack retrieval on Mistral-7B-Instruct, 8K–31K context, 3 needle types × 360 trials per method (1080 trials total per row, except H2O which OOMs at 16K+).

METHOD	NIAH RETRIEVAL	PPL DELTA (V-ONLY)	OVERHEAD
fraQtl V-only	98.5%	+0.012	zero
fraQtl V+K	97.8%	—	zero
TOVA	97.8%	+0.259	per-token
FP16 baseline	97.0%	0	—
SnapKV	94.1%	+0.214	per-token
StreamingLLM	37.8%	+0.548	per-token
H2O	0.0% (OOM at 16K+)	—	eager attn
KVQuant	—	+0.27	custom kernels
KIVI	—	+1.00	per-token

Downstream task accuracy

Compression doesn't just preserve perplexity — it preserves real task performance:

TASK	FP16 BASELINE	FRAQTL	DELTA
SQuAD v2 (QA)	58.5%	60.4%	+1.9%
TriviaQA (QA)	28.6%	31.3%	+2.7%
CNN/DailyMail	21.1%	20.6%	-0.5%
XSum	20.1%	23.1%	+3.0%

3 out of 4 tasks improved. Average improvement: +1.8%.

How to use it

Option 1: Pre-compressed models (free)

Download from HuggingFace. Load with standard transformers. No fraQtl install needed.

    from transformers import AutoModelForCausalLM

    model = AutoModelForCausalLM.from_pretrained("fraQtl/Mistral-7B-compressed")

    # That's it. 3.5× less KV memory. Same API.

Option 2: Runtime compression (early access)

Compress any HuggingFace model at inference time. One line. No retraining.

    import fraqtl

    fraqtl.enable_cache_compression(model, k=16, bits=3)

    # Your model now uses 3.5× less V-cache memory.

Compress what matters.

3.5× V-cache compression. Retrieval preserved: 98.5% on 1080 long-context trials. Training-free. One line of code.

Get Early Access

Research

fraQtl is backed by two peer-reviewed preprints with validated results across 7 models (3B–70B parameters). The mathematical foundations are derived from first principles — not heuristics.

Paper I: Quantization Dominates Rank Reduction for KV-Cache Compression (arXiv:2604.11501) — 5 models, 124M–14B parameters
Paper II: Variance Is Not Importance — 46 experiments, spectral analysis of compressibility

Patent pending · fraqtl.ai · contact@fraqtl.ai