BENCHMARKS · PUBLIC ARTIFACTS · WEIGHT MATRIX

Receipts.

Two surfaces: the public Qwen 3.6 35B-A3B compressed artifact (MMLU + ∞Bench + VRAM) and the matched-4-bit weight matrix across MHA, GQA-2, GQA-3, and GQA-4. Every number reproducible against SHOMER-NUMBERS and golden_eval v1.

Public artifact: Qwen 3.6 35B-A3B compressed runs 128K context on 1× A100-80GB at 51.7 GB VRAM, MMLU 82.24% (FP16: 82.40%), ∞Bench passkey 30/30 at 125,315 tokens.
Weight matrix: matched-4-bit results across 4 architecture families using the same core recipe (K_{GATE_UP} = K_DOWN = 256, INT3 + sign correction), with validation reported per model. GOLDEN_EVAL_V1 · WikiText-2 64×512 · 256+256 prefix+continuation · PPL + KL(FP16 ‖ compressed)

00Public artifact · Qwen 3.6 35B-A3B

Live on Hugging Face. Loads through standard Transformers workflow. 23.8 GB on disk. MoE: 35B params, 3B active.

Metric	fraQtl compressed	FP16 reference	Δ	Notes
MMLU (5-shot)	82.24%	82.40%	−0.16 pp	14,042 questions, 57 subjects
∞Bench passkey	30 / 30	30 / 30	tied	at 125,315 tokens
VRAM @ 16K context	25.6 GB	~71 GB	−45 GB	both fit on 1× A100-80GB
VRAM @ 64K context	36.8 GB	82.86 GB → OOM	FP16 needs 2× A100	measured OOM on 80 GB ceiling
VRAM @ 128K context	51.7 GB	85+ GB	FP16 needs 2× A100	28 GB headroom on 1× A100
Disk size	23.8 GB	~70 GB	~3× smaller	safetensors, full artifact

VRAM at 16K and 128K for FP16 are model + KV-cache extrapolations — conservative lower bounds. 64K FP16 = 82.86 GB measured (OOMs on 80 GB ceiling).

Weight-compressed artifact. Runtime KV-cache compression is a separate early-access layer and is not stacked on top of this artifact in these numbers.

Reproducible: huggingface.co/fraQtl/Qwen3.6-35B-A3B-compressed · install via pip install fraqtl-runtime

01The weight matrix

All Δ PPL and KL are against each row's own FP16 baseline (same eval set). Single-seed unless marked. 3-seed Mistral row is mean ± std.

Model / Architecture	FP16 PPL	bnb NF4 Δ / KL / b/w	AWQ 4-bit Δ / KL / b/w	GPTQ 4-bit Δ / KL / b/w	fraQtl INT3+sign Δ / KL / b/w
Mistral-7B-Instruct-v0.2 (GQA-4)	6.6068	+0.1430 / 0.0274 / 4.50 (3-seed)	+0.1590 ±0.0001 / 0.0293 / 4.12 (3-seed)	+0.1721 / 0.0436 / 4.12	+0.0504 ±0.0108 / 0.0165 / 3.62 (3-seed)
Llama-3.2-3B-Instruct (GQA-3)	12.3720	+0.7445 / 0.0644 / 4.50	+0.8015 / 0.0605 / 4.12	BROKEN ¹	+0.4279 / 0.0254 / 3.86
Qwen2.5-3B-Instruct (GQA-2)	8.3597	+0.5091 / 0.0843 / 4.50	+0.5865 / 0.0775 / 4.12 ²	+0.5945 / 0.0991 / 4.12	+0.2241 / 0.0362 / 4.18
Phi-3-mini-4k-instruct (TRUE MHA)	6.5048	+0.5873 / 0.0965 / 4.50	FAILED ³	n/a ⁴	+0.2061 / 0.0466 / 3.86

¹ gptqmodel 1.9.0 predates Llama 3 (released 2024-04): library-version failure, not a GPTQ-method failure. Newer gptqmodel 2.x has its own PyPI-drift instability (blocked original C61). Excluded from ratio/scoreboard math.

² AWQ on Qwen2 needed a nn.Module.__getattr__ monkey-patch inside awq.quantize() to forward Catcher's missing attention_type attribute. Script: notebooks/benchmarks/C60_qwen_awq_patched.py.

³ AWQ on Phi-3 hits a DIFFERENT AutoAWQ bug (KeyError: 'type') than Qwen2's. Not chased per sunk-cost rule; deferred to llm-compressor sprint. Script: C63_phi3_mha_golden.py.

⁴ GPTQ not attempted on Phi-3 this session (focus: MHA universality via bnb/fraQtl).

fraQtl ratios vs each peer

Ratios are stable across architectures. The MHA result (Phi-3, 2.85× vs bnb) lands in the same band as the GQA-4 result (Mistral, 2.84× vs bnb).

Model	Δ vs bnb NF4	Δ vs AWQ 4-bit	Δ vs GPTQ 4-bit	KL vs bnb	KL vs AWQ	KL vs GPTQ
Mistral 7B Instruct	2.84× tighter	3.16× tighter	3.41× tighter	1.66×	1.78×	2.64×
Llama 3.2 3B Instruct	1.74× tighter	1.87× tighter	n/a (GPTQ broken)	2.54×	2.39×	n/a
Qwen 2.5 3B Instruct	2.27× tighter	2.62× tighter	2.65× tighter	2.33×	2.14×	2.74×
Phi-3-mini-4k (MHA)	2.85× tighter	n/a (AWQ failed)	n/a (not attempted)	2.07×	n/a	n/a

Scoreboard — honest counts

BROKEN peer results (Llama GPTQ) and library failures (AWQ on Phi-3) are NOT counted as fraQtl wins — they're counted as "peer didn't run."

Peer	Attempted	Usable	fraQtl wins (matched)
bnb NF4	4/4	4/4	4/4
AWQ 4-bit	4/4	3/4 (Qwen needed patch, Phi-3 failed)	3/3
GPTQ 4-bit	3/4 (Phi-3 not attempted)	2/3 (Llama broken)	2/2

Total: 9/9 matched-bits wins where the peer produced a usable number, across 4 architectures. Same core recipe with per-model validation; no architecture-specific algorithm changes.

02fraQtl config key

fraQtl has TWO configs in the KV lane. Don't conflate them. Read before the KV tables below.

PARTIAL STACK · EXPERIMENT-GRADE

fraQtl V-only 4b

Eigenbasis only + INT4 uniform quantization. No sign correction.

C44b (1080-cell NIAH) · C44c sanity · C44d (128K multi-needle) · C44e (128K depth-sweep)

FULL STACK · PAPER-GRADE · "NANO-BITS"

fraQtl V-only INT3 + sign

Eigenbasis + k=16 FP16 protect + LM-INT3 on sacrifice dims + sign correction. 3-seed-validated, pinned-eval.

C38a cross-arch 3-seed · C25q Qwen 3.6 35B-A3B MoE

BD rule: cold-email headline numbers use full stack (INT3+sign) from C38a. Long-context KIVI-2 collapse stories use partial stack (V-only 4b) from C44 family — still valid, just honestly labeled.

03KV cache substrate

The matrix above compresses weights. fraQtl's V-theorem + sign-correction mechanism also applies to KV cache (runtime-dynamic tensor). Cross-architecture KV results follow — per-needle NIAH and 3-seed PPL/KL where measured.

Cross-arch NIAH at matched 4-bit vs KIVI / KVQuant / eviction peers

Config = fraQtl V-only 4b (partial stack, eigenbasis + INT4 uniform, no sign correction). Full-stack INT3+sign numbers in the PPL/KL table below.

Model / Arch	Protocol	fraQtl V-only 4b	fraQtl V+K 4b	KIVI-4	KVQuant-4	KIVI-2	PyramidKV 0.7
Mistral 7B Instruct (GQA-4)	C44b · 1080-cell NIAH 4K–31K	94.4%	93.3%	93.3%	93.3%	37.8%	86.1%
Qwen 2.5 3B Instruct (GQA-2)	C44b · 1080-cell NIAH 4K–31K	99.4%	79.4% ⁵	98.9%	97.8% / 100% (sink0)	1.7%	69.4%
Llama 3.1 8B Instruct (GQA-8)	C44d · 128K multi-needle	93.3%	n/a	93.3%	n/a	0.0%	n/a
Llama 3.1 8B Instruct (GQA-8)	C44e · 128K depth-sweep ⁶	100%	n/a	100%	n/a	0.0%	n/a

⁵ fraQtl V+K on Qwen 3B GQA-2 used Mistral's blind k_protect — per-model calibration pending (see EVAL-PROTOCOL-LOCKED). Not a fraQtl limitation claim.

⁶ notebooks/benchmarks/C44e_llama3_8b_128k_shallow_depth.py (separate from C44e_pyramidkv_bakeoff.py).

Full-stack C44-harness re-fires queued for Wednesday — benchmark-agent ticket: run C44b/C44d/C44e at fraQtl V-only INT3+sign config to widen the matched-bits margin. Partial-stack already ties KIVI-4; full-stack expected to extend to sub-4bit Pareto.

Cross-arch KV PPL/KL 3-seed pinned

The 3-seed rows bolded below are the paper-grade / BD-headline numbers. Full-stack = INT3 + sign correction.

Model / Arch	Config	Δ PPL (3-seed)	KL (3-seed)	NIAH (3-seed)	Source
Partial stack · C25 family
Mistral 7B Instruct (GQA-4)	V-only k=16 LM-INT3	+0.027 ±0.005	0.00381 ±0.00001	—	SHOMER C25
Mistral 7B Instruct (GQA-4)	V+K k=16 LM-INT3	+0.062 ±0.001	0.00751 ±0.00013	—	SHOMER C25
Full stack · INT3 + sign · C38a (paper-grade)
Mistral 7B Instruct (GQA-4)	V-only k=16 INT3+sign	+0.0015 ±0.0044	0.00317	—	SHOMER C38a v2
Mistral 7B Instruct (GQA-4)	K k=16 INT3+sign	+0.0070 ±0.0043	—	—	SHOMER C38a
Llama 3.2 3B Instruct (GQA-3)	V k=8 INT3+sign	+0.0181 ±0.015	0.00308	99.4% (179/180)	SHOMER C38a
Llama 3.2 3B Instruct (GQA-3)	K k=16 INT3+sign	+0.0221 ±0.0079	0.00312	100% (180/180)	SHOMER C38a
Qwen 2.5 3B Instruct (GQA-2)	V k=8 INT3+sign	+0.0542 ±0.0116	0.00362	51.7% ⁷	SHOMER C38a
Phi-3-mini-128k-instruct (TRUE MHA)	V-only k=16 INT3+sign ⁹	+0.0002 (1-seed)	0.00122	93.3% (56/60)	C62 2026-04-21
Phi-3-mini-128k-instruct (TRUE MHA)	V+K k=16 INT3+sign ⁹	+0.0073 (1-seed)	0.00383	95.0% (57/60 — ties FP16)	C62 2026-04-21
Partial stack · MoE hybrid attention · C25q
Qwen 3.6 35B-A3B (MoE hybrid)	V-only k=16 INT3	+0.045 ±0.011	0.0183 ±0.0023	—	SHOMER C25q
Qwen 3.6 35B-A3B (MoE hybrid)	V+K k=16 INT3	+0.166 ±0.049	0.0221 ±0.0015	—	SHOMER C25q

⁷ Qwen 2.5 3B V NIAH: 3-seed mean 51.7%, FP16 baseline 58.3% — small-model short-context NIAH has low ceiling. KLD is 1.5× tighter than INT4 uniform; honest mixed result on GQA-2 V cache, not a clean NIAH win. The sign-correction paradigm's KLD advantage is the cross-arch invariant; PPL/NIAH narrow on GQA-2.

⁹ Phi-3 MHA rows are 1-seed from C62 patched run (2026-04-21, source commit 3a0bff7). Per EVAL-PROTOCOL-LOCKED.md ratio rule, 1-seed is acceptable for RATIO comparisons where ratio > 1.5× — these rows satisfy that vs KIVI-4 (140× tighter Δ PPL) and KVQuant-4 (15.5× tighter Δ PPL). For ABSOLUTE-delta public citation, a 3-seed re-run is queued (~30 min × 2 additional seeds on A100-40GB). Raw JSON verified against live run output.

Phi-3 MHA full-stack lands the cross-arch story. Partial-stack on MHA KV previously lost to KIVI-4 (73.3% vs 91.7%); full-stack V-only INT3+sign flips it — +0.0002 Δ PPL (140× tighter than KIVI-4), 93.3% NIAH. V+K ties FP16 at 95.0%. Full-stack C44-harness re-fires for partial-stack margin extension staged for Wednesday.

Gap (queued, P0): 3-seed re-run of Phi-3 MHA V + K (1-seed → 3-seed) for absolute-delta public citation. Ratio-rule-compliant today; 3-seed closes the absolute-delta path. Also: K-cache 3-seed INT3+sign on Qwen 2.5 3B (GQA-2).

04KIVI-2 catastrophic-collapse signature

Combined signal from C44b + C44c sanity + C44d + C44e shallow-depth. Every context length. Every needle type. Every depth position.

Context	Architecture	Grid	KIVI-2 retention
4K–31K	Mistral 7B Instruct (GQA-4)	1080 cells · 3 needles × 4 ctx × 5 depths × 3 trials	37.8%
4K–31K	Qwen 2.5 3B Instruct (GQA-2)	1080 cells	1.7%
128K	Llama 3.1 8B Instruct (GQA-8)	15 cells · 3 needles × 5 trials @ depth 50 (C44d)	0.0%
128K	Llama 3.1 8B Instruct (GQA-8)	9 cells · technical_password × 3 depths × 3 trials (C44e)	0.0%

Pattern: KIVI-2's per-token K quantization collapses at EVERY context length tested, across EVERY needle type, across EVERY depth position. fraQtl V-only 4-bit ties or beats KIVI-4 everywhere. fraQtl 2-bit regimes via V-only k=16 INT3 sit at 3.5× (different Pareto point from KIVI-2's 8×).

05KV substrate — scoreboard

fraQtl V-only 4b (partial stack) vs peer KV families. Full-stack INT3+sign numbers in §03 extend the margin further; C44-harness full-stack re-fires queued for Wednesday.

Peer (KV)	Architectures tested	fraQtl partial-stack V-only 4b result
KIVI-2 (per-token K 2-bit)	Mistral GQA-4 / Qwen GQA-2 / Llama 3.1 GQA-8 @128K × 2 protocols / Phi-3 MHA	5/5 catastrophic margins (fraQtl V-only ≥73% vs KIVI-2 ≤38%)
KIVI-4 (per-token K 4-bit)	Mistral / Qwen / Llama 3.1 @128K × 2 / Phi-3 MHA	4/5 non-losses — Mistral + Qwen + 2× Llama 128K non-losses; Phi-3 MHA partial-stack 73.3% vs KIVI-4 91.7% — LOSES (full-stack re-fire Wednesday; full-stack V-only already 140× tighter Δ PPL vs KIVI-4 — see §03)
KVQuant-4	Mistral / Qwen / Phi-3 MHA	2/3 ties or wins on GQA (Mistral 94.4>93.3 / Qwen 99.4>97.8 / Qwen sink0 100). Phi-3 MHA partial-stack 73.3% vs KVQuant-4 93.3% — LOSES (277× Δ PPL gap); full-stack V-only flips it at 15.5× tighter Δ PPL, V+K ties NIAH at 95.0% — see §03
PyramidKV 0.7	Mistral / Qwen	2/2 wins (+8.3 pp / +30.0 pp aggregate over PyramidKV)
SnapKV / H2O / StreamingLLM / TOVA / ExpectedAttention (C44 original, Mistral only)	Mistral 7B Instruct	fraQtl 100% · TOVA 97.8% (near-tie) · SnapKV 94.1% · ExpectedAttention 53.3% · StreamingLLM 37.8% · H2O 22.0% — 3 of 5 competitors fail catastrophically (<54%)

Source for eviction-peer row: docs/MLP-QUANTIZATION-C14-C17-RESULTS.md L1916–1982. Original "fraQtl is the ONLY method" framing retracted per source L1978.

06Memory lane — runtime GPU memory

Distinct from the weight-compression-ratio matrix above. Runtime GPU memory story on Mistral 7B at 32K context.

Artifact	State	Measurement
FP16 baseline @ 32K	measured	13.91 GB weights · 23.12 GB peak inference · 56.88 GB headroom (A100-80GB)
fraQtl-packed (current loader) @ 32K	measured	9.84 GB on disk (30% smaller) · 14.03 GB weights in memory (loader dequantizes to FP16 by design) · coherent generation · 20.4 tok/s
`PackedLinear` scaffold (runtime-packed)	projected · sanity-passed	3.2× per-MLP-layer with shared eigenbasis → ~3.5 GB weights for full Mistral 7B (vs 14 GB FP16). Numerical sanity: 0.04% mean-relative error on fp16 matmul vs nn.Linear reconstruction
Full end-to-end measured demo	coming · Wednesday 2026-04-22	Mistral 7B at ~3.5 GB weights + 32K context · actual nvidia-smi trace

Current state (2026-04-19): disk compression is real and shipped (9.84 GB artifact, coherent generation verified). Runtime GPU memory savings are projected from a passing PackedLinear scaffold and will be measured end-to-end by Wednesday. The existing narrative ("Mistral 7B compressed") is honest today; "Mistral 7B runs in 3.5 GB of GPU memory" requires the Wednesday run before going public.

07Boundaries — what's NOT in this matrix

Hold-the-line

GPTQ on Phi-3: not attempted this session. No implicit claim.
AWQ 3/5-bit: AutoAWQ 0.2.9 is 4-bit only; multi-bit pending llm-compressor scoped image (C50D-AWQ-MULTI-BIT.md).
KV cache compression: different substrate. See C44B-KIVI-KVQUANT-BAKEOFF.md, C44E-PYRAMIDKV-BAKEOFF.md.
MoE matched-protocol: Qwen 3.6 35B-A3B has KV-cache numbers; weight-compression matched-bits vs AWQ/GPTQ on MoE is next-session C50d work.
Throughput / latency / memory: infra agent lane, see C51-THROUGHPUT-BAKEOFF.md.
Multi-seed beyond Mistral: per EVAL-PROTOCOL-LOCKED ratio rule, 1-seed is acceptable for ratio comparisons above 1.5× threshold. All reported ratios exceed 1.5×.

08Artifacts + commit hashes

Every number in this matrix traces to a committed script + raw JSON on the fraqtl-hf-cache Modal volume.

Scripts, raw JSONs, commit hashes · per row

Row	Script	Raw JSON (fraqtl-hf-cache:fraqtl-results/)	Commit hash(es)
Mistral 3-seed (fraQtl, bnb, AWQ)	C60_golden_mistral_instruct.py	c60_golden_mistralai_Mistral-7B-Instruct-v02_seed{42,123,2024}.json	1209261, 9038743, d3af566
Mistral GPTQ 1-seed	C64_gptq_pinned.py via modal_run_gptq.py	c64_gptq_pinned_mistralai_Mistral-7B-Instruct-v02_seed42.json	17b563f
Llama 3B golden (fraQtl, bnb, AWQ) 1-seed	C60_golden_mistral_instruct.py (env MODEL=…)	c60_golden_meta-llama_Llama-32-3B-Instruct_seed42.json	1209261
Llama 3B GPTQ BROKEN 1-seed	C64_gptq_pinned.py (env MODEL=…)	c64_gptq_pinned_meta-llama_Llama-32-3B-Instruct_seed42.json	cc24b1d
Qwen 3B golden (fraQtl, bnb) 1-seed	C60_golden_mistral_instruct.py (env MODEL=…)	c60_golden_Qwen_Qwen25-3B-Instruct_seed42.json	1209261
Qwen 3B AWQ (Catcher-patched) 1-seed	C60_qwen_awq_patched.py	c60_qwen_awq_patched_seed42.json	19bea7f
Qwen 3B GPTQ 1-seed	C64_gptq_pinned.py (env MODEL=…)	c64_gptq_pinned_Qwen_Qwen25-3B-Instruct_seed42.json	cc24b1d
Phi-3 MHA golden (fraQtl, bnb) 1-seed	C63_phi3_mha_golden.py	c63_phi3_mha_golden_seed42.json	cc24b1d
C44b KIVI/KVQuant 1080-cell (Mistral + Qwen)	C44b_kivi_kvquant_bakeoff.py	c44b_{mistral,qwen3b}_full_seed42.json	1209261
C44e PyramidKV 1080-cell (Mistral + Qwen)	C44e_pyramidkv_bakeoff.py	c44e_pyramidkv_{mistral,qwen3b}_seed42.json	3b5e8a8, c1388b8
C44d Llama 3.1 8B 128K multi-needle NIAH	C44d_llama3_8b_128k_3needle.py	c44d_llama3_8b_128k_3needle_seed42.json (volume)	see docs/C44D-MULTI-NEEDLE-128K.md
C44e-shallow Llama 3.1 8B 128K depth sweep	C44e_llama3_8b_128k_shallow_depth.py	c44e_llama3_8b_128k_shallow_depth_seed42.json (volume)	see docs/C44E-SHALLOW-DEPTH-128K.md
C62 Phi-3 MHA full-stack KV (V-only + V+K INT3+sign, 1-seed)	notebooks/benchmarks/C62_phi3_fullstack.py	c62_phi3_mha_seed42.json (8 configs, partial + full-stack side-by-side)	3a0bff7 · 527afad (matrix mirror)
Memory @ 32K (FP16 + fraQtl-packed)	fraqtl/docs/MEASURED-MEMORY-32K.md	nvidia-smi trace @ A100-80GB	f5db558
PackedLinear scaffold + sanity test	fraqtl/src/fraqtl/packed_linear.py · experiments/packed_linear_sanity.py	0.04% mean-rel error vs nn.Linear	19e27c3

One compression principle. Multiple architectures.

Try the public Qwen 3.6 35B-A3B compressed artifact, or pilot fraQtl on your own model stack.

Hugging Face Request Pilot KV Cache Explainer

SHOMER-VERIFIED All numbers verified against commit 527afad of docs/FRAQTL-ARCHITECTURE-AGNOSTIC-MATRIX.md (Phi-3 MHA full-stack mirror; source data commit 3a0bff7) and docs/SHOMER-NUMBERS.md.
Protocol: docs/EVAL-PROTOCOL-LOCKED.md · golden_eval v1 · WikiText-2 test 64×512 · 256+256 prefix+continuation.
Reproduce: github.com/fraqtl · HF: huggingface.co/fraqtl · Questions: samuel@fraqtl.ai