BENCHMARKS · PUBLIC ARTIFACTS · WEIGHT MATRIX

Receipts.

Two surfaces: the public Qwen 3.6 35B-A3B compressed artifact (MMLU + ∞Bench + VRAM) and the matched-4-bit weight matrix across MHA, GQA-2, GQA-3, and GQA-4. Every number reproducible against SHOMER-NUMBERS and golden_eval v1.

Public artifact: Qwen 3.6 35B-A3B compressed runs 128K context on 1× A100-80GB at 51.7 GB VRAM, MMLU 82.24% (FP16: 82.40%), ∞Bench passkey 30/30 at 125,315 tokens.
Weight matrix: matched-4-bit results across 4 architecture families using the same core recipe (KGATE_UP = KDOWN = 256, INT3 + sign correction), with validation reported per model. GOLDEN_EVAL_V1 · WikiText-2 64×512 · 256+256 prefix+continuation · PPL + KL(FP16 ‖ compressed)

00Public artifact · Qwen 3.6 35B-A3B

Live on Hugging Face. Loads through standard Transformers workflow. 23.8 GB on disk. MoE: 35B params, 3B active.

Metric fraQtl compressed FP16 reference Δ Notes
MMLU (5-shot) 82.24% 82.40% −0.16 pp 14,042 questions, 57 subjects
∞Bench passkey 30 / 30 30 / 30 tied at 125,315 tokens
VRAM @ 16K context 25.6 GB ~71 GB −45 GB both fit on 1× A100-80GB
VRAM @ 64K context 36.8 GB 82.86 GB → OOM FP16 needs 2× A100 measured OOM on 80 GB ceiling
VRAM @ 128K context 51.7 GB 85+ GB FP16 needs 2× A100 28 GB headroom on 1× A100
Disk size 23.8 GB ~70 GB ~3× smaller safetensors, full artifact

VRAM at 16K and 128K for FP16 are model + KV-cache extrapolations — conservative lower bounds. 64K FP16 = 82.86 GB measured (OOMs on 80 GB ceiling).

Weight-compressed artifact. Runtime KV-cache compression is a separate early-access layer and is not stacked on top of this artifact in these numbers.

Reproducible: huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed · install via pip install fraqtl-runtime

01The weight matrix

All Δ PPL and KL are against each row's own FP16 baseline (same eval set). Single-seed unless marked. 3-seed Mistral row is mean ± std.

Model / Architecture FP16
PPL
bnb NF4
Δ / KL / b/w
AWQ 4-bit
Δ / KL / b/w
GPTQ 4-bit
Δ / KL / b/w
fraQtl INT3+sign
Δ / KL / b/w
Mistral-7B-Instruct-v0.2 (GQA-4) 6.6068 +0.1430 / 0.0274 / 4.50 (3-seed) +0.1590 ±0.0001 / 0.0293 / 4.12 (3-seed) +0.1721 / 0.0436 / 4.12 +0.0504 ±0.0108 / 0.0165 / 3.62 (3-seed)
Llama-3.2-3B-Instruct (GQA-3) 12.3720 +0.7445 / 0.0644 / 4.50 +0.8015 / 0.0605 / 4.12 BROKEN 1 +0.4279 / 0.0254 / 3.86
Qwen2.5-3B-Instruct (GQA-2) 8.3597 +0.5091 / 0.0843 / 4.50 +0.5865 / 0.0775 / 4.12 2 +0.5945 / 0.0991 / 4.12 +0.2241 / 0.0362 / 4.18
Phi-3-mini-4k-instruct (TRUE MHA) 6.5048 +0.5873 / 0.0965 / 4.50 FAILED 3 n/a 4 +0.2061 / 0.0466 / 3.86

1 gptqmodel 1.9.0 predates Llama 3 (released 2024-04): library-version failure, not a GPTQ-method failure. Newer gptqmodel 2.x has its own PyPI-drift instability (blocked original C61). Excluded from ratio/scoreboard math.

2 AWQ on Qwen2 needed a nn.Module.__getattr__ monkey-patch inside awq.quantize() to forward Catcher's missing attention_type attribute. Script: notebooks/benchmarks/C60_qwen_awq_patched.py.

3 AWQ on Phi-3 hits a DIFFERENT AutoAWQ bug (KeyError: 'type') than Qwen2's. Not chased per sunk-cost rule; deferred to llm-compressor sprint. Script: C63_phi3_mha_golden.py.

4 GPTQ not attempted on Phi-3 this session (focus: MHA universality via bnb/fraQtl).

fraQtl ratios vs each peer

Ratios are stable across architectures. The MHA result (Phi-3, 2.85× vs bnb) lands in the same band as the GQA-4 result (Mistral, 2.84× vs bnb).

Model Δ vs bnb NF4 Δ vs AWQ 4-bit Δ vs GPTQ 4-bit KL vs bnb KL vs AWQ KL vs GPTQ
Mistral 7B Instruct2.84× tighter3.16× tighter3.41× tighter1.66×1.78×2.64×
Llama 3.2 3B Instruct1.74× tighter1.87× tightern/a (GPTQ broken)2.54×2.39×n/a
Qwen 2.5 3B Instruct2.27× tighter2.62× tighter2.65× tighter2.33×2.14×2.74×
Phi-3-mini-4k (MHA)2.85× tightern/a (AWQ failed)n/a (not attempted)2.07×n/an/a

Scoreboard — honest counts

BROKEN peer results (Llama GPTQ) and library failures (AWQ on Phi-3) are NOT counted as fraQtl wins — they're counted as "peer didn't run."

PeerAttemptedUsablefraQtl wins (matched)
bnb NF44/44/44/4
AWQ 4-bit4/43/4 (Qwen needed patch, Phi-3 failed)3/3
GPTQ 4-bit3/4 (Phi-3 not attempted)2/3 (Llama broken)2/2
Total: 9/9 matched-bits wins where the peer produced a usable number, across 4 architectures. Same core recipe with per-model validation; no architecture-specific algorithm changes.

02fraQtl config key

fraQtl has TWO configs in the KV lane. Don't conflate them. Read before the KV tables below.

PARTIAL STACK · EXPERIMENT-GRADE
fraQtl V-only 4b
Eigenbasis only + INT4 uniform quantization. No sign correction.
C44b (1080-cell NIAH) · C44c sanity · C44d (128K multi-needle) · C44e (128K depth-sweep)
FULL STACK · PAPER-GRADE · "NANO-BITS"
fraQtl V-only INT3 + sign
Eigenbasis + k=16 FP16 protect + LM-INT3 on sacrifice dims + sign correction. 3-seed-validated, pinned-eval.
C38a cross-arch 3-seed · C25q Qwen 3.6 35B-A3B MoE
BD rule: cold-email headline numbers use full stack (INT3+sign) from C38a. Long-context KIVI-2 collapse stories use partial stack (V-only 4b) from C44 family — still valid, just honestly labeled.

03KV cache substrate

The matrix above compresses weights. fraQtl's V-theorem + sign-correction mechanism also applies to KV cache (runtime-dynamic tensor). Cross-architecture KV results follow — per-needle NIAH and 3-seed PPL/KL where measured.

Cross-arch NIAH at matched 4-bit vs KIVI / KVQuant / eviction peers

Config = fraQtl V-only 4b (partial stack, eigenbasis + INT4 uniform, no sign correction). Full-stack INT3+sign numbers in the PPL/KL table below.

Model / Arch Protocol fraQtl V-only 4b fraQtl V+K 4b KIVI-4 KVQuant-4 KIVI-2 PyramidKV 0.7
Mistral 7B Instruct (GQA-4) C44b · 1080-cell NIAH 4K–31K 94.4% 93.3% 93.3% 93.3% 37.8% 86.1%
Qwen 2.5 3B Instruct (GQA-2) C44b · 1080-cell NIAH 4K–31K 99.4% 79.4% 5 98.9% 97.8% / 100% (sink0) 1.7% 69.4%
Llama 3.1 8B Instruct (GQA-8) C44d · 128K multi-needle 93.3% n/a 93.3% n/a 0.0% n/a
Llama 3.1 8B Instruct (GQA-8) C44e · 128K depth-sweep 6 100% n/a 100% n/a 0.0% n/a

5 fraQtl V+K on Qwen 3B GQA-2 used Mistral's blind k_protect — per-model calibration pending (see EVAL-PROTOCOL-LOCKED). Not a fraQtl limitation claim.

6 notebooks/benchmarks/C44e_llama3_8b_128k_shallow_depth.py (separate from C44e_pyramidkv_bakeoff.py).

Full-stack C44-harness re-fires queued for Wednesday — benchmark-agent ticket: run C44b/C44d/C44e at fraQtl V-only INT3+sign config to widen the matched-bits margin. Partial-stack already ties KIVI-4; full-stack expected to extend to sub-4bit Pareto.

Cross-arch KV PPL/KL 3-seed pinned

The 3-seed rows bolded below are the paper-grade / BD-headline numbers. Full-stack = INT3 + sign correction.

Model / Arch Config Δ PPL (3-seed) KL (3-seed) NIAH (3-seed) Source
Partial stack · C25 family
Mistral 7B Instruct (GQA-4)V-only k=16 LM-INT3+0.027 ±0.0050.00381 ±0.00001SHOMER C25
Mistral 7B Instruct (GQA-4)V+K k=16 LM-INT3+0.062 ±0.0010.00751 ±0.00013SHOMER C25
Full stack · INT3 + sign · C38a (paper-grade)
Mistral 7B Instruct (GQA-4)V-only k=16 INT3+sign+0.0015 ±0.00440.00317SHOMER C38a v2
Mistral 7B Instruct (GQA-4)K k=16 INT3+sign+0.0070 ±0.0043SHOMER C38a
Llama 3.2 3B Instruct (GQA-3)V k=8 INT3+sign+0.0181 ±0.0150.0030899.4% (179/180)SHOMER C38a
Llama 3.2 3B Instruct (GQA-3)K k=16 INT3+sign+0.0221 ±0.00790.00312100% (180/180)SHOMER C38a
Qwen 2.5 3B Instruct (GQA-2)V k=8 INT3+sign+0.0542 ±0.01160.0036251.7% 7SHOMER C38a
Phi-3-mini-128k-instruct (TRUE MHA)V-only k=16 INT3+sign 9+0.0002 (1-seed)0.0012293.3% (56/60)C62 2026-04-21
Phi-3-mini-128k-instruct (TRUE MHA)V+K k=16 INT3+sign 9+0.0073 (1-seed)0.0038395.0% (57/60 — ties FP16)C62 2026-04-21
Partial stack · MoE hybrid attention · C25q
Qwen 3.6 35B-A3B (MoE hybrid)V-only k=16 INT3+0.045 ±0.0110.0183 ±0.0023SHOMER C25q
Qwen 3.6 35B-A3B (MoE hybrid)V+K k=16 INT3+0.166 ±0.0490.0221 ±0.0015SHOMER C25q

7 Qwen 2.5 3B V NIAH: 3-seed mean 51.7%, FP16 baseline 58.3% — small-model short-context NIAH has low ceiling. KLD is 1.5× tighter than INT4 uniform; honest mixed result on GQA-2 V cache, not a clean NIAH win. The sign-correction paradigm's KLD advantage is the cross-arch invariant; PPL/NIAH narrow on GQA-2.

9 Phi-3 MHA rows are 1-seed from C62 patched run (2026-04-21, source commit 3a0bff7). Per EVAL-PROTOCOL-LOCKED.md ratio rule, 1-seed is acceptable for RATIO comparisons where ratio > 1.5× — these rows satisfy that vs KIVI-4 (140× tighter Δ PPL) and KVQuant-4 (15.5× tighter Δ PPL). For ABSOLUTE-delta public citation, a 3-seed re-run is queued (~30 min × 2 additional seeds on A100-40GB). Raw JSON verified against live run output.

Phi-3 MHA full-stack lands the cross-arch story. Partial-stack on MHA KV previously lost to KIVI-4 (73.3% vs 91.7%); full-stack V-only INT3+sign flips it — +0.0002 Δ PPL (140× tighter than KIVI-4), 93.3% NIAH. V+K ties FP16 at 95.0%. Full-stack C44-harness re-fires for partial-stack margin extension staged for Wednesday.
Gap (queued, P0): 3-seed re-run of Phi-3 MHA V + K (1-seed → 3-seed) for absolute-delta public citation. Ratio-rule-compliant today; 3-seed closes the absolute-delta path. Also: K-cache 3-seed INT3+sign on Qwen 2.5 3B (GQA-2).

04KIVI-2 catastrophic-collapse signature

Combined signal from C44b + C44c sanity + C44d + C44e shallow-depth. Every context length. Every needle type. Every depth position.

Context Architecture Grid KIVI-2 retention
4K–31KMistral 7B Instruct (GQA-4)1080 cells · 3 needles × 4 ctx × 5 depths × 3 trials37.8%
4K–31KQwen 2.5 3B Instruct (GQA-2)1080 cells1.7%
128KLlama 3.1 8B Instruct (GQA-8)15 cells · 3 needles × 5 trials @ depth 50 (C44d)0.0%
128KLlama 3.1 8B Instruct (GQA-8)9 cells · technical_password × 3 depths × 3 trials (C44e)0.0%
Pattern: KIVI-2's per-token K quantization collapses at EVERY context length tested, across EVERY needle type, across EVERY depth position. fraQtl V-only 4-bit ties or beats KIVI-4 everywhere. fraQtl 2-bit regimes via V-only k=16 INT3 sit at 3.5× (different Pareto point from KIVI-2's 8×).

05KV substrate — scoreboard

fraQtl V-only 4b (partial stack) vs peer KV families. Full-stack INT3+sign numbers in §03 extend the margin further; C44-harness full-stack re-fires queued for Wednesday.

Peer (KV) Architectures tested fraQtl partial-stack V-only 4b result
KIVI-2 (per-token K 2-bit) Mistral GQA-4 / Qwen GQA-2 / Llama 3.1 GQA-8 @128K × 2 protocols / Phi-3 MHA 5/5 catastrophic margins (fraQtl V-only ≥73% vs KIVI-2 ≤38%)
KIVI-4 (per-token K 4-bit) Mistral / Qwen / Llama 3.1 @128K × 2 / Phi-3 MHA 4/5 non-losses — Mistral + Qwen + 2× Llama 128K non-losses; Phi-3 MHA partial-stack 73.3% vs KIVI-4 91.7% — LOSES (full-stack re-fire Wednesday; full-stack V-only already 140× tighter Δ PPL vs KIVI-4 — see §03)
KVQuant-4 Mistral / Qwen / Phi-3 MHA 2/3 ties or wins on GQA (Mistral 94.4>93.3 / Qwen 99.4>97.8 / Qwen sink0 100). Phi-3 MHA partial-stack 73.3% vs KVQuant-4 93.3% — LOSES (277× Δ PPL gap); full-stack V-only flips it at 15.5× tighter Δ PPL, V+K ties NIAH at 95.0% — see §03
PyramidKV 0.7 Mistral / Qwen 2/2 wins (+8.3 pp / +30.0 pp aggregate over PyramidKV)
SnapKV / H2O / StreamingLLM / TOVA / ExpectedAttention (C44 original, Mistral only) Mistral 7B Instruct fraQtl 100% · TOVA 97.8% (near-tie) · SnapKV 94.1% · ExpectedAttention 53.3% · StreamingLLM 37.8% · H2O 22.0% — 3 of 5 competitors fail catastrophically (<54%)

Source for eviction-peer row: docs/MLP-QUANTIZATION-C14-C17-RESULTS.md L1916–1982. Original "fraQtl is the ONLY method" framing retracted per source L1978.

06Memory lane — runtime GPU memory

Distinct from the weight-compression-ratio matrix above. Runtime GPU memory story on Mistral 7B at 32K context.

Artifact State Measurement
FP16 baseline @ 32K measured 13.91 GB weights · 23.12 GB peak inference · 56.88 GB headroom (A100-80GB)
fraQtl-packed (current loader) @ 32K measured 9.84 GB on disk (30% smaller) · 14.03 GB weights in memory (loader dequantizes to FP16 by design) · coherent generation · 20.4 tok/s
PackedLinear scaffold (runtime-packed) projected · sanity-passed 3.2× per-MLP-layer with shared eigenbasis → ~3.5 GB weights for full Mistral 7B (vs 14 GB FP16). Numerical sanity: 0.04% mean-relative error on fp16 matmul vs nn.Linear reconstruction
Full end-to-end measured demo coming · Wednesday 2026-04-22 Mistral 7B at ~3.5 GB weights + 32K context · actual nvidia-smi trace
Current state (2026-04-19): disk compression is real and shipped (9.84 GB artifact, coherent generation verified). Runtime GPU memory savings are projected from a passing PackedLinear scaffold and will be measured end-to-end by Wednesday. The existing narrative ("Mistral 7B compressed") is honest today; "Mistral 7B runs in 3.5 GB of GPU memory" requires the Wednesday run before going public.

07Boundaries — what's NOT in this matrix

Hold-the-line
  • GPTQ on Phi-3: not attempted this session. No implicit claim.
  • AWQ 3/5-bit: AutoAWQ 0.2.9 is 4-bit only; multi-bit pending llm-compressor scoped image (C50D-AWQ-MULTI-BIT.md).
  • KV cache compression: different substrate. See C44B-KIVI-KVQUANT-BAKEOFF.md, C44E-PYRAMIDKV-BAKEOFF.md.
  • MoE matched-protocol: Qwen 3.6 35B-A3B has KV-cache numbers; weight-compression matched-bits vs AWQ/GPTQ on MoE is next-session C50d work.
  • Throughput / latency / memory: infra agent lane, see C51-THROUGHPUT-BAKEOFF.md.
  • Multi-seed beyond Mistral: per EVAL-PROTOCOL-LOCKED ratio rule, 1-seed is acceptable for ratio comparisons above 1.5× threshold. All reported ratios exceed 1.5×.

08Artifacts + commit hashes

Every number in this matrix traces to a committed script + raw JSON on the fraqtl-hf-cache Modal volume.

Scripts, raw JSONs, commit hashes · per row
Row Script Raw JSON (fraqtl-hf-cache:fraqtl-results/) Commit hash(es)
Mistral 3-seed (fraQtl, bnb, AWQ)C60_golden_mistral_instruct.pyc60_golden_mistralai_Mistral-7B-Instruct-v02_seed{42,123,2024}.json1209261, 9038743, d3af566
Mistral GPTQ 1-seedC64_gptq_pinned.py via modal_run_gptq.pyc64_gptq_pinned_mistralai_Mistral-7B-Instruct-v02_seed42.json17b563f
Llama 3B golden (fraQtl, bnb, AWQ) 1-seedC60_golden_mistral_instruct.py (env MODEL=…)c60_golden_meta-llama_Llama-32-3B-Instruct_seed42.json1209261
Llama 3B GPTQ BROKEN 1-seedC64_gptq_pinned.py (env MODEL=…)c64_gptq_pinned_meta-llama_Llama-32-3B-Instruct_seed42.jsoncc24b1d
Qwen 3B golden (fraQtl, bnb) 1-seedC60_golden_mistral_instruct.py (env MODEL=…)c60_golden_Qwen_Qwen25-3B-Instruct_seed42.json1209261
Qwen 3B AWQ (Catcher-patched) 1-seedC60_qwen_awq_patched.pyc60_qwen_awq_patched_seed42.json19bea7f
Qwen 3B GPTQ 1-seedC64_gptq_pinned.py (env MODEL=…)c64_gptq_pinned_Qwen_Qwen25-3B-Instruct_seed42.jsoncc24b1d
Phi-3 MHA golden (fraQtl, bnb) 1-seedC63_phi3_mha_golden.pyc63_phi3_mha_golden_seed42.jsoncc24b1d
C44b KIVI/KVQuant 1080-cell (Mistral + Qwen)C44b_kivi_kvquant_bakeoff.pyc44b_{mistral,qwen3b}_full_seed42.json1209261
C44e PyramidKV 1080-cell (Mistral + Qwen)C44e_pyramidkv_bakeoff.pyc44e_pyramidkv_{mistral,qwen3b}_seed42.json3b5e8a8, c1388b8
C44d Llama 3.1 8B 128K multi-needle NIAHC44d_llama3_8b_128k_3needle.pyc44d_llama3_8b_128k_3needle_seed42.json (volume)see docs/C44D-MULTI-NEEDLE-128K.md
C44e-shallow Llama 3.1 8B 128K depth sweepC44e_llama3_8b_128k_shallow_depth.pyc44e_llama3_8b_128k_shallow_depth_seed42.json (volume)see docs/C44E-SHALLOW-DEPTH-128K.md
C62 Phi-3 MHA full-stack KV (V-only + V+K INT3+sign, 1-seed)notebooks/benchmarks/C62_phi3_fullstack.pyc62_phi3_mha_seed42.json (8 configs, partial + full-stack side-by-side)3a0bff7 · 527afad (matrix mirror)
Memory @ 32K (FP16 + fraQtl-packed)fraqtl/docs/MEASURED-MEMORY-32K.mdnvidia-smi trace @ A100-80GBf5db558
PackedLinear scaffold + sanity testfraqtl/src/fraqtl/packed_linear.py · experiments/packed_linear_sanity.py0.04% mean-rel error vs nn.Linear19e27c3

One compression principle. Multiple architectures.

Try the public Qwen 3.6 35B-A3B compressed artifact, or pilot fraQtl on your own model stack.

Hugging Face Request Pilot KV Cache Explainer
SHOMER-VERIFIED All numbers verified against commit 527afad of docs/FRAQTL-ARCHITECTURE-AGNOSTIC-MATRIX.md (Phi-3 MHA full-stack mirror; source data commit 3a0bff7) and docs/SHOMER-NUMBERS.md.
Protocol: docs/EVAL-PROTOCOL-LOCKED.md · golden_eval v1 · WikiText-2 test 64×512 · 256+256 prefix+continuation.
Reproduce: github.com/fraqtl · HF: huggingface.co/fraqtl · Questions: samuel@fraqtl.ai