Llama-4-Maverick-17B-128E-Instruct โ€” MINT GGUF

Q8_0 (8.5 bits/weight) | 397 GB | Ready for Ollama, llama.cpp, LM Studio

Quick Start

Important: You must use the included Modelfile when running with Ollama. It contains the chat template and stop tokens needed for proper generation. Without it, the model will repeat output endlessly.

# Ollama โ€” download the GGUF and Modelfile, then create:
ollama create llama4-maverick -f Modelfile

# Run:
ollama run llama4-maverick

# llama.cpp (stop tokens handled automatically)
llama-cli -m Llama-4-Maverick-Q8_0.gguf -p "Hello" -ngl 99

Model Details

Spec Value
Base model meta-llama/Llama-4-Maverick-17B-128E-Instruct
Parameters 402B total, 17B active (MoE, 128 experts)
Architecture Llama 4 MoE (24 dense + 24 MoE layers)
Quantization Q8_0 (8.5 BPW)
Size 397 GB

Quality vs Size Analysis

This model sits at the knee of the diminishing returns curve โ€” the best quality-per-GB tradeoff. Beyond ~407 GB, each additional GB yields minimal quality improvement.

Quality vs Size Curve

Modelfile

A Modelfile is included with proper Llama 4 chat template and stop tokens (<|eot|>, <|eom|>).

FROM Llama-4-Maverick-Q8_0.gguf
PARAMETER num_ctx 32768
PARAMETER temperature 0.6
PARAMETER stop "<|eot|>"
PARAMETER stop "<|eom|>"

About MINT

MINT (Memory-Informed N-bit Tuning) is a data-free, per-tensor mixed-precision quantization method.

Paper & Code: github.com/baa-ai/MINT

Rate-Distortion Curve

Rate-Distortion Curve

Quality vs size trade-off from MINT MCKP allocator. โ˜… = optimal knee point.

Budget Size Avg Bits Loss
167 GB 166.7 GB 3.1 125.0816
200 GB 200.0 GB 3.9 6.9136 โ˜…
233 GB 233.3 GB 4.6 4.2019
267 GB 266.6 GB 5.2 2.5834
300 GB 299.9 GB 5.9 1.6647
333 GB 333.3 GB 6.7 0.8914
367 GB 366.6 GB 7.4 0.5404
400 GB 399.9 GB 8.3 0.2990
433 GB 429.3 GB 8.7 0.2404
467 GB 462.1 GB 9.5 0.2030
500 GB 499.6 GB 10.3 0.1607
533 GB 532.4 GB 11.1 0.1237
567 GB 560.5 GB 11.7 0.1066
600 GB 598.0 GB 12.6 0.0851
633 GB 626.1 GB 13.2 0.0691
667 GB 663.6 GB 14.1 0.0478
700 GB 691.8 GB 14.7 0.0318
733 GB 729.3 GB 15.6 0.0106

Generated by MINT rate-distortion optimization.

Downloads last month
43
GGUF
Model size
401B params
Architecture
llama4
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for baa-ai/Llama-4-Maverick-17B-128E-Instruct-MINT-407GB-GGUF

Collection including baa-ai/Llama-4-Maverick-17B-128E-Instruct-MINT-407GB-GGUF