fraQtl compresses model weights and KV cache for production inference. Same model behavior, lower VRAM, longer context, no retraining.
Free technical pilot for the first 5 qualified design partners.
Prefer email? contact@fraqtl.ai
At long context, KV cache size determines how many users you can serve concurrently and what models fit on your hardware. Most teams are forced to choose: shorter context, smaller model, or more GPUs. fraQtl removes that choice.
FP16 64K = 82.86 GB measured (OOM on 80 GB ceiling). 16K and 128K FP16 figures are conservative model + KV-cache extrapolations. fraQtl numbers measured on a single A100-80GB with the public artifact.
Send us your model and a workload sample. We calibrate, deliver a compressed artifact + benchmark report + integration path for your deployment stack. If the numbers don't move, no commitment.
enable_cache_compression(model)Free technical pilot for the first 5 qualified design partners.
Public artifacts on HuggingFace. Reproducible numbers. Lower-memory benchmarked artifacts that load through the standard Transformers loader.
At long context, KV cache becomes a major driver of GPU memory. fraQtl reduces memory pressure while preserving downstream retrieval and task quality — allowing larger contexts or more concurrent workloads on the same hardware. The V-only path achieves 3.5× V-cache compression at 98.5% NIAH retrieval; a V+K variant trades a little precision for ~3× total KV reduction.
Quantization in the right basis replaces rank reduction as the default for KV compression.
All numbers traceable to our public HuggingFace model card. pip install fraqtl-runtime to reproduce.
BF16 / FP16 is the quality reference. Public Q4 is the practical size baseline teams already deploy. fraQtl targets FP16-level behavior plus long-context KV memory savings the existing Q4 stack doesn't address.
Matched Q4_K_M bakeoff in progress — results published when locked.
| METRIC | FRAQTL | NEAREST PUBLISHED | DIFFERENCE |
|---|---|---|---|
| NIAH retention (1080 trials) | 98.5% | 97.8% · TOVA | +0.7 pp |
| PPL Δ at 3.5× compression | +0.012 | +0.214 · SnapKV | ~18× tighter |
| NIAH at 128K (Llama 3.1 8B) | 100% | 0% · KIVI-2 | +100 pp |
| GPUs needed at 128K (35B MoE) | 1× A100-80GB | 2× A100-80GB · FP16 | 50% hardware |
| METHOD | MEMORY | QUALITY | CALIBRATION | OVERHEAD | NIAH 1080 TRIALS |
PPL Δ V-ONLY |
|---|---|---|---|---|---|---|
| fraQtl (V-only) | 3.5× ✓ | ✓ within noise | ✓ 0.3s | 0% | 98.5% | +0.012 |
| fraQtl (V+K) | ~3× ✓ | ✓ within noise | ✓ 0.3s | 0% | 97.8% | — |
| FP16 baseline | 1× | — | — | — | 97.0% | 0 |
| TOVA | ✓ | ⚠ degrades | ✓ none | per-token | 97.8% | +0.259 |
| SnapKV | ✓ | ⚠ degrades | ✓ none | per-token | 94.1% | +0.214 |
| StreamingLLM | ✓ | ✗ poor | ✓ none | per-token | 37.8% | +0.548 |
| H2O† | ✓ | ✗ OOM | ✓ none | eager attn | 0.0% | — |
| KVQuant-2 | ✓ 2-bit | ⚠ moderate | ✗ 5–15 min | custom kernels | — | +0.27 |
| KIVI-2 | ✓ 2-bit | ✗ collapses at 128K | ✓ none | per-token | 37.8% → 0%@128K | +1.00 |
| BENCHMARK | FP16 | FRAQTL | Δ |
|---|---|---|---|
| MMLU 0-shot | 82.40% | 82.24% | −0.16 pp |
| ∞Bench Passkey @ 125,315 tokens | 30 / 30 | 30 / 30 | parity |
| HumanEval pass@1 | reference | 100% retention | within sampling noise |
| Wikitext-2 PPL Δ | — | +0.033 | tight |
Same model. Same hardware. 8× the context. The difference between needing 1 GPU and needing 2.
| CONTEXT | FP16 BASELINE | FRAQTL | HARDWARE |
|---|---|---|---|
| 16K | ~71 GB | 25.6 GB | Both fit on 1× A100-80GB |
| 64K | 82.9 GB → OOM | 36.8 GB | FP16 needs 2 GPUs; fraQtl 1 |
| 128K | 85+ GB | 51.7 GB | FP16 needs 2 GPUs; fraQtl 1 · 28 GB free |
Each cell is a KV-cache dimension. Watch what happens to attention routing under each compression strategy.
Rank throws away signal.
Quantization preserves it.
A 30-day technical pilot. We calibrate on your workload, benchmark against your FP16 baseline, and hand you a deployable artifact. Free for the first 5 qualified design partners.