Most KV-cache methods compress by deletion — removing directions, losing signal.
fraQtl compresses by precision: every token survives, attention routing is preserved.
Proven across five transformer models at matched storage budgets.
Three findings that change how you compress transformers.
At matched storage budgets, quantization consistently outperforms rank reduction. The gap is not about basis — it is about the geometry of softmax routing.
01
Quantization beats rank reduction everywhere
In every model and budget tested, INT4 outperforms rank reduction at matched bit cost by 4 to 364 PPL. The margin grows with GQA aggressiveness.
02
Deletion causes discrete routing failures
Rank reduction flips 4.6% of attention routing decisions vs 0.03% for INT4. Bounded noise preserves score ordering. Deletion does not.
03
The basis doesn't matter — the paradigm does
Quantization quality is basis-independent (spread $<0.4$ PPL across all rotations). The advantage is about preserving all dimensions.
04
Joint $K$+$V$ INT4 at 75% reduction costs +0.18 PPL
Both $K$ and $V$ are safely quantizable. Per-channel symmetric INT4 requires no retraining, no special basis, no optimization.
MECHANISM
Why deletion fails where noise succeeds.
Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.
Rank throws away signal. Quantization preserves it.
Rank Reduction — dimensions deleted
PPL: 9.19 (FP16)
Quantization — precision reduced
PPL: 9.19 (FP16)
THE FIGURE
Perplexity vs bits/dimension on Mistral 7B.
Rank reduction explodes below 4 bits. Uniform quantization collapses below 3 bits. fraQtl stays flat to 2 bits — the dead zone is where every other method fails.
When rank reduction deletes a KV direction it creates a score perturbation $|\delta| \approx \sigma_{\text{removed}}$. If this exceeds the gap $\Delta = s_{i_1} - s_{i_2}$, attention flips to the wrong token. Quantization keeps $|\delta|$ bounded at $\frac{\sigma}{2^b}$ — $768\times$ smaller at INT4.
FRAQTL INT4
94%
Routing Stability ✅
Attention flow preserved — perturbation bounded at σ/2b
RANK REDUCTION k=32
61%
Routing Stability ❌
High attention drift — deleted directions create unbounded perturbation
Rank Reduction — direction deleted
FP16 baseline — Token A routes
Quantization INT4 — precision reduced
FP16 baseline — Token A routes
QUANTIZATION VISUALIZED
Bounded noise vs information loss.
Every KV value survives quantization — just rounded to the nearest step. Rank reduction eliminates entire directions. Watch how precision degrades gracefully while deletion destroys structure.
FP16 original signal
INT4 quantized (all dims)
Rank-reduced (dims deleted)
INTERACTIVE
Pick your memory budget. See who wins.
At every storage constraint, fraQtl outperforms rank reduction. Drag the slider — the gap only grows tighter.
Memory budget:3.6 b/d
MATCHED-BUDGET COMPARISON
Quantization vs rank reduction across all models.
Every row is a real experimental result at matched storage. Filter by model or method. Sort any column.
MODEL ↕
ARCH ↕
BUDGET ↕
METHOD ↕
DIMS ↕
PPL ↕
vs FP16 ↕
MARGIN ↕
THEORETICAL RESULT
A perturbation asymmetry formalises the gap.
Under the softmax Fisher metric, projection damage exceeds quantization damage by $3 \times 2^{2b}$ per direction — $768\times$ at INT4.
PROPOSITION 2 — PERTURBATION ASYMMETRY
For direction $u$ with signal $\sigma_u$ under $G = \mathrm{diag}(\alpha) - \alpha\alpha^\top$:
$$\mathrm{KL}_{\mathrm{proj}} = \tfrac{1}{2}\,\sigma_u^2 \cdot u^\top G u \qquad \text{vs} \qquad \mathbb{E}[\mathrm{KL}_{\mathrm{quant}}] = \frac{\sigma_u^2}{2 \cdot 3 \cdot 2^{2b}} \cdot u^\top G u$$
$$\text{Ratio:}\quad 3 \times 2^{2b} \quad (768\times \text{ at INT4})$$
The sensitivity $u^\top G u$ cancels — both methods face the same softmax geometry.
The difference is entirely in perturbation magnitude: rank reduction deletes the direction's contribution
to the score gap $\Delta = s_{i_1} - s_{i_2}$; if $|\delta_i| > |\Delta|$, attention flips — a discrete failure.
Quantization perturbs by $\mathcal{O}(\lambda_i / 2^b)$, crossing the boundary only when noise exceeds $\Delta$,
which is rare at $b \geq 4$.
THE RESEARCH
Independent ML research from FraQtl AI.
This work emerged from a systematic attempt to improve rank reduction — and the discovery
that the paradigm itself was the barrier. After exhausting every closed-form metric, perturbation
series, and learned correction, the breakthrough came from changing the compression operator entirely.