Production compression with a closed-form basis. The pullback metric over attention values picks the directions that downstream loss depends on; everything else is quantized with bounded distortion. Empirically validated across multiple transformer architectures. Patent pending.
Every number measured at inference time. 3-seed validation on Mistral 7B (GQA-4) and Qwen 2.5 3B (GQA-2); additional architectures scout-tested (Phi-3 MHA, Llama 3.2 3B GQA-3, Llama 3.1 8B GQA-8, Qwen 3.6 35B-A3B). Same V-cache primitive across the reported architecture families, with validation reported per model. Compressed artifacts load through standard Hugging Face workflows.
| TASK | FP16 | FRAQTL | DELTA |
|---|---|---|---|
| SQuAD v2 (QA) | 58.5% | 60.4% | +1.9% |
| TriviaQA (QA) | 28.6% | 31.3% | +2.7% |
| CNN/DailyMail | 21.1% | 20.6% | -0.5% |
| XSum (Summarization) | 20.1% | 23.1% | +3.0% |
At matched storage budgets, quantization consistently outperforms rank reduction. The gap is not about basis — it is about the geometry of softmax routing.
Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.
Rank throws away signal.
Quantization preserves it.
Rank reduction explodes below 4 bits. Uniform quantization collapses below 3 bits. fraQtl stays flat to 2 bits — the dead zone is where every other method fails.
When rank reduction deletes a KV direction it creates a score perturbation $|\delta| \approx \sigma_{\text{removed}}$. If this exceeds the gap $\Delta = s_{i_1} - s_{i_2}$, attention flips to the wrong token. Quantization keeps $|\delta|$ bounded at $\frac{\sigma}{2^b}$ — $768\times$ smaller at INT4.
Every KV value survives quantization — just rounded to the nearest step. Rank reduction eliminates entire directions. Watch how precision degrades gracefully while deletion destroys structure.
At every storage constraint, fraQtl outperforms rank reduction. Drag the slider — the gap only grows tighter.
Every row is a real experimental result at matched storage. Filter by model or method. Sort any column.
| MODEL ↕ | ARCH ↕ | BUDGET ↕ | METHOD ↕ | DIMS ↕ | PPL ↕ | vs FP16 ↕ | MARGIN ↕ |
|---|
Under the softmax Fisher metric, projection damage exceeds quantization damage by $3 \times 2^{2b}$ per direction — $768\times$ at INT4.
Pre-compressed weights you can download today, or runtime KV cache compression for tested architectures.
Peer review in progress. Full preprints available.
This work emerged from a systematic attempt to improve rank reduction — and the discovery that the paradigm itself was the barrier. After exhausting every closed-form metric, perturbation series, and learned correction, the breakthrough came from changing the compression operator entirely.
The result is currently in peer review.