Two surfaces: the public Qwen 3.6 35B-A3B compressed artifact (MMLU + ∞Bench + VRAM) and the matched-4-bit weight matrix across MHA, GQA-2, GQA-3, and GQA-4. Every number reproducible against SHOMER-NUMBERS and golden_eval v1.
Live on Hugging Face. Loads through standard Transformers workflow. 23.8 GB on disk. MoE: 35B params, 3B active.
| Metric | fraQtl compressed | FP16 reference | Δ | Notes |
|---|---|---|---|---|
| MMLU (5-shot) | 82.24% | 82.40% | −0.16 pp | 14,042 questions, 57 subjects |
| ∞Bench passkey | 30 / 30 | 30 / 30 | tied | at 125,315 tokens |
| VRAM @ 16K context | 25.6 GB | ~71 GB | −45 GB | both fit on 1× A100-80GB |
| VRAM @ 64K context | 36.8 GB | 82.86 GB → OOM | FP16 needs 2× A100 | measured OOM on 80 GB ceiling |
| VRAM @ 128K context | 51.7 GB | 85+ GB | FP16 needs 2× A100 | 28 GB headroom on 1× A100 |
| Disk size | 23.8 GB | ~70 GB | ~3× smaller | safetensors, full artifact |
VRAM at 16K and 128K for FP16 are model + KV-cache extrapolations — conservative lower bounds. 64K FP16 = 82.86 GB measured (OOMs on 80 GB ceiling).
Weight-compressed artifact. Runtime KV-cache compression is a separate early-access layer and is not stacked on top of this artifact in these numbers.
pip install fraqtl-runtime
All Δ PPL and KL are against each row's own FP16 baseline (same eval set). Single-seed unless marked. 3-seed Mistral row is mean ± std.
| Model / Architecture | FP16 PPL |
bnb NF4 Δ / KL / b/w |
AWQ 4-bit Δ / KL / b/w |
GPTQ 4-bit Δ / KL / b/w |
fraQtl INT3+sign Δ / KL / b/w |
|---|---|---|---|---|---|
| Mistral-7B-Instruct-v0.2 (GQA-4) | 6.6068 | +0.1430 / 0.0274 / 4.50 (3-seed) | +0.1590 ±0.0001 / 0.0293 / 4.12 (3-seed) | +0.1721 / 0.0436 / 4.12 | +0.0504 ±0.0108 / 0.0165 / 3.62 (3-seed) |
| Llama-3.2-3B-Instruct (GQA-3) | 12.3720 | +0.7445 / 0.0644 / 4.50 | +0.8015 / 0.0605 / 4.12 | BROKEN 1 | +0.4279 / 0.0254 / 3.86 |
| Qwen2.5-3B-Instruct (GQA-2) | 8.3597 | +0.5091 / 0.0843 / 4.50 | +0.5865 / 0.0775 / 4.12 2 | +0.5945 / 0.0991 / 4.12 | +0.2241 / 0.0362 / 4.18 |
| Phi-3-mini-4k-instruct (TRUE MHA) | 6.5048 | +0.5873 / 0.0965 / 4.50 | FAILED 3 | n/a 4 | +0.2061 / 0.0466 / 3.86 |
1 gptqmodel 1.9.0 predates Llama 3 (released 2024-04): library-version failure, not a GPTQ-method failure. Newer gptqmodel 2.x has its own PyPI-drift instability (blocked original C61). Excluded from ratio/scoreboard math.
2 AWQ on Qwen2 needed a nn.Module.__getattr__ monkey-patch inside awq.quantize() to forward Catcher's missing attention_type attribute. Script: notebooks/benchmarks/C60_qwen_awq_patched.py.
3 AWQ on Phi-3 hits a DIFFERENT AutoAWQ bug (KeyError: 'type') than Qwen2's. Not chased per sunk-cost rule; deferred to llm-compressor sprint. Script: C63_phi3_mha_golden.py.
4 GPTQ not attempted on Phi-3 this session (focus: MHA universality via bnb/fraQtl).
Ratios are stable across architectures. The MHA result (Phi-3, 2.85× vs bnb) lands in the same band as the GQA-4 result (Mistral, 2.84× vs bnb).
| Model | Δ vs bnb NF4 | Δ vs AWQ 4-bit | Δ vs GPTQ 4-bit | KL vs bnb | KL vs AWQ | KL vs GPTQ |
|---|---|---|---|---|---|---|
| Mistral 7B Instruct | 2.84× tighter | 3.16× tighter | 3.41× tighter | 1.66× | 1.78× | 2.64× |
| Llama 3.2 3B Instruct | 1.74× tighter | 1.87× tighter | n/a (GPTQ broken) | 2.54× | 2.39× | n/a |
| Qwen 2.5 3B Instruct | 2.27× tighter | 2.62× tighter | 2.65× tighter | 2.33× | 2.14× | 2.74× |
| Phi-3-mini-4k (MHA) | 2.85× tighter | n/a (AWQ failed) | n/a (not attempted) | 2.07× | n/a | n/a |
BROKEN peer results (Llama GPTQ) and library failures (AWQ on Phi-3) are NOT counted as fraQtl wins — they're counted as "peer didn't run."
| Peer | Attempted | Usable | fraQtl wins (matched) |
|---|---|---|---|
| bnb NF4 | 4/4 | 4/4 | 4/4 |
| AWQ 4-bit | 4/4 | 3/4 (Qwen needed patch, Phi-3 failed) | 3/3 |
| GPTQ 4-bit | 3/4 (Phi-3 not attempted) | 2/3 (Llama broken) | 2/2 |
fraQtl has TWO configs in the KV lane. Don't conflate them. Read before the KV tables below.
The matrix above compresses weights. fraQtl's V-theorem + sign-correction mechanism also applies to KV cache (runtime-dynamic tensor). Cross-architecture KV results follow — per-needle NIAH and 3-seed PPL/KL where measured.
Config = fraQtl V-only 4b (partial stack, eigenbasis + INT4 uniform, no sign correction). Full-stack INT3+sign numbers in the PPL/KL table below.
| Model / Arch | Protocol | fraQtl V-only 4b | fraQtl V+K 4b | KIVI-4 | KVQuant-4 | KIVI-2 | PyramidKV 0.7 |
|---|---|---|---|---|---|---|---|
| Mistral 7B Instruct (GQA-4) | C44b · 1080-cell NIAH 4K–31K | 94.4% | 93.3% | 93.3% | 93.3% | 37.8% | 86.1% |
| Qwen 2.5 3B Instruct (GQA-2) | C44b · 1080-cell NIAH 4K–31K | 99.4% | 79.4% 5 | 98.9% | 97.8% / 100% (sink0) | 1.7% | 69.4% |
| Llama 3.1 8B Instruct (GQA-8) | C44d · 128K multi-needle | 93.3% | n/a | 93.3% | n/a | 0.0% | n/a |
| Llama 3.1 8B Instruct (GQA-8) | C44e · 128K depth-sweep 6 | 100% | n/a | 100% | n/a | 0.0% | n/a |
5 fraQtl V+K on Qwen 3B GQA-2 used Mistral's blind k_protect — per-model calibration pending (see EVAL-PROTOCOL-LOCKED). Not a fraQtl limitation claim.
6 notebooks/benchmarks/C44e_llama3_8b_128k_shallow_depth.py (separate from C44e_pyramidkv_bakeoff.py).
The 3-seed rows bolded below are the paper-grade / BD-headline numbers. Full-stack = INT3 + sign correction.
| Model / Arch | Config | Δ PPL (3-seed) | KL (3-seed) | NIAH (3-seed) | Source |
|---|---|---|---|---|---|
| Partial stack · C25 family | |||||
| Mistral 7B Instruct (GQA-4) | V-only k=16 LM-INT3 | +0.027 ±0.005 | 0.00381 ±0.00001 | — | SHOMER C25 |
| Mistral 7B Instruct (GQA-4) | V+K k=16 LM-INT3 | +0.062 ±0.001 | 0.00751 ±0.00013 | — | SHOMER C25 |
| Full stack · INT3 + sign · C38a (paper-grade) | |||||
| Mistral 7B Instruct (GQA-4) | V-only k=16 INT3+sign | +0.0015 ±0.0044 | 0.00317 | — | SHOMER C38a v2 |
| Mistral 7B Instruct (GQA-4) | K k=16 INT3+sign | +0.0070 ±0.0043 | — | — | SHOMER C38a |
| Llama 3.2 3B Instruct (GQA-3) | V k=8 INT3+sign | +0.0181 ±0.015 | 0.00308 | 99.4% (179/180) | SHOMER C38a |
| Llama 3.2 3B Instruct (GQA-3) | K k=16 INT3+sign | +0.0221 ±0.0079 | 0.00312 | 100% (180/180) | SHOMER C38a |
| Qwen 2.5 3B Instruct (GQA-2) | V k=8 INT3+sign | +0.0542 ±0.0116 | 0.00362 | 51.7% 7 | SHOMER C38a |
| Phi-3-mini-128k-instruct (TRUE MHA) | V-only k=16 INT3+sign 9 | +0.0002 (1-seed) | 0.00122 | 93.3% (56/60) | C62 2026-04-21 |
| Phi-3-mini-128k-instruct (TRUE MHA) | V+K k=16 INT3+sign 9 | +0.0073 (1-seed) | 0.00383 | 95.0% (57/60 — ties FP16) | C62 2026-04-21 |
| Partial stack · MoE hybrid attention · C25q | |||||
| Qwen 3.6 35B-A3B (MoE hybrid) | V-only k=16 INT3 | +0.045 ±0.011 | 0.0183 ±0.0023 | — | SHOMER C25q |
| Qwen 3.6 35B-A3B (MoE hybrid) | V+K k=16 INT3 | +0.166 ±0.049 | 0.0221 ±0.0015 | — | SHOMER C25q |
7 Qwen 2.5 3B V NIAH: 3-seed mean 51.7%, FP16 baseline 58.3% — small-model short-context NIAH has low ceiling. KLD is 1.5× tighter than INT4 uniform; honest mixed result on GQA-2 V cache, not a clean NIAH win. The sign-correction paradigm's KLD advantage is the cross-arch invariant; PPL/NIAH narrow on GQA-2.
9 Phi-3 MHA rows are 1-seed from C62 patched run (2026-04-21, source commit 3a0bff7). Per EVAL-PROTOCOL-LOCKED.md ratio rule, 1-seed is acceptable for RATIO comparisons where ratio > 1.5× — these rows satisfy that vs KIVI-4 (140× tighter Δ PPL) and KVQuant-4 (15.5× tighter Δ PPL). For ABSOLUTE-delta public citation, a 3-seed re-run is queued (~30 min × 2 additional seeds on A100-40GB). Raw JSON verified against live run output.
Combined signal from C44b + C44c sanity + C44d + C44e shallow-depth. Every context length. Every needle type. Every depth position.
| Context | Architecture | Grid | KIVI-2 retention |
|---|---|---|---|
| 4K–31K | Mistral 7B Instruct (GQA-4) | 1080 cells · 3 needles × 4 ctx × 5 depths × 3 trials | 37.8% |
| 4K–31K | Qwen 2.5 3B Instruct (GQA-2) | 1080 cells | 1.7% |
| 128K | Llama 3.1 8B Instruct (GQA-8) | 15 cells · 3 needles × 5 trials @ depth 50 (C44d) | 0.0% |
| 128K | Llama 3.1 8B Instruct (GQA-8) | 9 cells · technical_password × 3 depths × 3 trials (C44e) | 0.0% |
fraQtl V-only 4b (partial stack) vs peer KV families. Full-stack INT3+sign numbers in §03 extend the margin further; C44-harness full-stack re-fires queued for Wednesday.
| Peer (KV) | Architectures tested | fraQtl partial-stack V-only 4b result |
|---|---|---|
| KIVI-2 (per-token K 2-bit) | Mistral GQA-4 / Qwen GQA-2 / Llama 3.1 GQA-8 @128K × 2 protocols / Phi-3 MHA | 5/5 catastrophic margins (fraQtl V-only ≥73% vs KIVI-2 ≤38%) |
| KIVI-4 (per-token K 4-bit) | Mistral / Qwen / Llama 3.1 @128K × 2 / Phi-3 MHA | 4/5 non-losses — Mistral + Qwen + 2× Llama 128K non-losses; Phi-3 MHA partial-stack 73.3% vs KIVI-4 91.7% — LOSES (full-stack re-fire Wednesday; full-stack V-only already 140× tighter Δ PPL vs KIVI-4 — see §03) |
| KVQuant-4 | Mistral / Qwen / Phi-3 MHA | 2/3 ties or wins on GQA (Mistral 94.4>93.3 / Qwen 99.4>97.8 / Qwen sink0 100). Phi-3 MHA partial-stack 73.3% vs KVQuant-4 93.3% — LOSES (277× Δ PPL gap); full-stack V-only flips it at 15.5× tighter Δ PPL, V+K ties NIAH at 95.0% — see §03 |
| PyramidKV 0.7 | Mistral / Qwen | 2/2 wins (+8.3 pp / +30.0 pp aggregate over PyramidKV) |
| SnapKV / H2O / StreamingLLM / TOVA / ExpectedAttention (C44 original, Mistral only) | Mistral 7B Instruct | fraQtl 100% · TOVA 97.8% (near-tie) · SnapKV 94.1% · ExpectedAttention 53.3% · StreamingLLM 37.8% · H2O 22.0% — 3 of 5 competitors fail catastrophically (<54%) |
Source for eviction-peer row: docs/MLP-QUANTIZATION-C14-C17-RESULTS.md L1916–1982. Original "fraQtl is the ONLY method" framing retracted per source L1978.
Distinct from the weight-compression-ratio matrix above. Runtime GPU memory story on Mistral 7B at 32K context.
| Artifact | State | Measurement |
|---|---|---|
| FP16 baseline @ 32K | measured | 13.91 GB weights · 23.12 GB peak inference · 56.88 GB headroom (A100-80GB) |
| fraQtl-packed (current loader) @ 32K | measured | 9.84 GB on disk (30% smaller) · 14.03 GB weights in memory (loader dequantizes to FP16 by design) · coherent generation · 20.4 tok/s |
PackedLinear scaffold (runtime-packed) |
projected · sanity-passed | 3.2× per-MLP-layer with shared eigenbasis → ~3.5 GB weights for full Mistral 7B (vs 14 GB FP16). Numerical sanity: 0.04% mean-relative error on fp16 matmul vs nn.Linear reconstruction |
| Full end-to-end measured demo | coming · Wednesday 2026-04-22 | Mistral 7B at ~3.5 GB weights + 32K context · actual nvidia-smi trace |
C50D-AWQ-MULTI-BIT.md).C44B-KIVI-KVQUANT-BAKEOFF.md, C44E-PYRAMIDKV-BAKEOFF.md.C51-THROUGHPUT-BAKEOFF.md.Every number in this matrix traces to a committed script + raw JSON on the fraqtl-hf-cache Modal volume.
| Row | Script | Raw JSON (fraqtl-hf-cache:fraqtl-results/) | Commit hash(es) |
|---|---|---|---|
| Mistral 3-seed (fraQtl, bnb, AWQ) | C60_golden_mistral_instruct.py | c60_golden_mistralai_Mistral-7B-Instruct-v02_seed{42,123,2024}.json | 1209261, 9038743, d3af566 |
| Mistral GPTQ 1-seed | C64_gptq_pinned.py via modal_run_gptq.py | c64_gptq_pinned_mistralai_Mistral-7B-Instruct-v02_seed42.json | 17b563f |
| Llama 3B golden (fraQtl, bnb, AWQ) 1-seed | C60_golden_mistral_instruct.py (env MODEL=…) | c60_golden_meta-llama_Llama-32-3B-Instruct_seed42.json | 1209261 |
| Llama 3B GPTQ BROKEN 1-seed | C64_gptq_pinned.py (env MODEL=…) | c64_gptq_pinned_meta-llama_Llama-32-3B-Instruct_seed42.json | cc24b1d |
| Qwen 3B golden (fraQtl, bnb) 1-seed | C60_golden_mistral_instruct.py (env MODEL=…) | c60_golden_Qwen_Qwen25-3B-Instruct_seed42.json | 1209261 |
| Qwen 3B AWQ (Catcher-patched) 1-seed | C60_qwen_awq_patched.py | c60_qwen_awq_patched_seed42.json | 19bea7f |
| Qwen 3B GPTQ 1-seed | C64_gptq_pinned.py (env MODEL=…) | c64_gptq_pinned_Qwen_Qwen25-3B-Instruct_seed42.json | cc24b1d |
| Phi-3 MHA golden (fraQtl, bnb) 1-seed | C63_phi3_mha_golden.py | c63_phi3_mha_golden_seed42.json | cc24b1d |
| C44b KIVI/KVQuant 1080-cell (Mistral + Qwen) | C44b_kivi_kvquant_bakeoff.py | c44b_{mistral,qwen3b}_full_seed42.json | 1209261 |
| C44e PyramidKV 1080-cell (Mistral + Qwen) | C44e_pyramidkv_bakeoff.py | c44e_pyramidkv_{mistral,qwen3b}_seed42.json | 3b5e8a8, c1388b8 |
| C44d Llama 3.1 8B 128K multi-needle NIAH | C44d_llama3_8b_128k_3needle.py | c44d_llama3_8b_128k_3needle_seed42.json (volume) | see docs/C44D-MULTI-NEEDLE-128K.md |
| C44e-shallow Llama 3.1 8B 128K depth sweep | C44e_llama3_8b_128k_shallow_depth.py | c44e_llama3_8b_128k_shallow_depth_seed42.json (volume) | see docs/C44E-SHALLOW-DEPTH-128K.md |
| C62 Phi-3 MHA full-stack KV (V-only + V+K INT3+sign, 1-seed) | notebooks/benchmarks/C62_phi3_fullstack.py | c62_phi3_mha_seed42.json (8 configs, partial + full-stack side-by-side) | 3a0bff7 · 527afad (matrix mirror) |
| Memory @ 32K (FP16 + fraQtl-packed) | fraqtl/docs/MEASURED-MEMORY-32K.md | nvidia-smi trace @ A100-80GB | f5db558 |
| PackedLinear scaffold + sanity test | fraqtl/src/fraqtl/packed_linear.py · experiments/packed_linear_sanity.py | 0.04% mean-rel error vs nn.Linear | 19e27c3 |
Try the public Qwen 3.6 35B-A3B compressed artifact, or pilot fraQtl on your own model stack.
Hugging Face Request Pilot KV Cache Explainer527afad of docs/FRAQTL-ARCHITECTURE-AGNOSTIC-MATRIX.md (Phi-3 MHA full-stack mirror; source data commit 3a0bff7) and docs/SHOMER-NUMBERS.md.docs/EVAL-PROTOCOL-LOCKED.md · golden_eval v1 · WikiText-2 test 64×512 · 256+256 prefix+continuation.