Llama-4-Maverick-17B-128E-Instruct — MINT GGUF

Q8_0 (8.5 bits/weight) | 397 GB | Ready for Ollama, llama.cpp, LM Studio

Quick Start

Important: You must use the included Modelfile when running with Ollama. It contains the chat template and stop tokens needed for proper generation. Without it, the model will repeat output endlessly.

# Ollama — download the GGUF and Modelfile, then create:
ollama create llama4-maverick -f Modelfile

# Run:
ollama run llama4-maverick

# llama.cpp (stop tokens handled automatically)
llama-cli -m Llama-4-Maverick-Q8_0.gguf -p "Hello" -ngl 99

Model Details

Spec	Value
Base model	meta-llama/Llama-4-Maverick-17B-128E-Instruct
Parameters	402B total, 17B active (MoE, 128 experts)
Architecture	Llama 4 MoE (24 dense + 24 MoE layers)
Quantization	Q8_0 (8.5 BPW)
Size	397 GB

Quality vs Size Analysis

This model sits at the knee of the diminishing returns curve — the best quality-per-GB tradeoff. Beyond ~407 GB, each additional GB yields minimal quality improvement.

Modelfile

A Modelfile is included with proper Llama 4 chat template and stop tokens (<|eot|>, <|eom|>).

FROM Llama-4-Maverick-Q8_0.gguf
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER stop "<|eot|>"
PARAMETER stop "<|eom|>"

About MINT

MINT (Memory-Informed N-bit Tuning) is a data-free, per-tensor mixed-precision quantization method.

Paper & Code: github.com/baa-ai/MINT

Rate-Distortion Curve

Quality vs size trade-off from MINT MCKP allocator. ★ = optimal knee point.

Budget	Size	Avg Bits	Loss
167 GB	166.7 GB	3.1	125.0816
200 GB	200.0 GB	3.9	6.9136	★
233 GB	233.3 GB	4.6	4.2019
267 GB	266.6 GB	5.2	2.5834
300 GB	299.9 GB	5.9	1.6647
333 GB	333.3 GB	6.7	0.8914
367 GB	366.6 GB	7.4	0.5404
400 GB	399.9 GB	8.3	0.2990
433 GB	429.3 GB	8.7	0.2404
467 GB	462.1 GB	9.5	0.2030
500 GB	499.6 GB	10.3	0.1607
533 GB	532.4 GB	11.1	0.1237
567 GB	560.5 GB	11.7	0.1066
600 GB	598.0 GB	12.6	0.0851
633 GB	626.1 GB	13.2	0.0691
667 GB	663.6 GB	14.1	0.0478
700 GB	691.8 GB	14.7	0.0318
733 GB	729.3 GB	15.6	0.0106