⚠️ Vision is not yet working — text-only for now.
Image/video input is not functional in this build. The MiniMax-M3 vision tower has an unresolved rotary-embedding bug, so multimodal input is disabled in the serving config. Text generation works; treat this as a text-only coder model until vision is fixed.
MiniMax-M3-Coder-REAP32 (RFI)
A more-aggressively-pruned REAP32 build of MiniMax-M3 — 87 of 128 routed experts kept (32% pruned) — RFI-composite-quantized and packaged to serve on AMD ROCm / RDNA4 (gfx1201) through a custom vLLM fork. Smaller and faster than the 100-expert REAP22 build, with a bundled EAGLE3 drafter for speculative decoding.
⚠️ Experimental — in active development
Functional but experimental: it runs and produces coherent long-context output, but it has not been broadly evaluated, the quant/prune recipe is still being tuned, and behavior may change between revisions. Not for production.
What this is
A derivative of MiniMaxAI/MiniMax-M3 pruned to the 87-expert REAP32 expert set and quantized with the RFI composite scheme:
- Expert set = REAP32 (87/128). The exact experts kept per layer match JANGQ-AI/MiniMax-M3-REAP32-Coder. We recovered that inventory without any calibration or forward passes: the MoE router-gate rows are per-expert fingerprints preserved from the base model, so matching our rows against REAP32's (exact, cosine ≈ 1.0) identifies precisely which experts to drop. This build was produced by reaping our 100-expert REAP22 model down to those 87 survivors (13 dropped per MoE layer), renumbering survivors and slicing the router gate + selection bias in lockstep.
- RFI composite quantization (details below) — routed experts 2-bit, most else 6/8-bit, with the router / embeddings / lm_head / norms / lightning indexer kept FP16.
- Runtime + EAGLE3 speculative decoding via the
tcclaviger/vllm22:devfork.
Sources & credits
| Role | Source |
|---|---|
| Original base model (derivative of) | MiniMaxAI/MiniMax-M3 |
| REAP32 expert inventory (which 87 experts) | JANGQ-AI/MiniMax-M3-REAP32-Coder |
| REAP22 lineage + mixed-quant inspiration | JANGQ-AI/MiniMax-M3-REAP22-Coder |
| Concept inspiration (M3 Coder build) | JANG2_L (vMLX) |
| REAP pruning method | Cerebras — REAP (ICLR 2026, arXiv:2510.13999) |
EAGLE3 drafter (bundled, eagle3/) |
Inferact/MiniMax-M3-EAGLE3 (TorchSpec) |
| Router-gate cross-reference, RFI quant, reap + build | tcclaviger |
All rights and the license belong to MiniMaxAI; this is a derivative work and inherits the MiniMax-M3 license.
Architecture (post-prune)
| Field | Value |
|---|---|
| Backbone | MiniMax-M3 sparse MoE (DSA "lightning indexer" attention) |
| Hidden size | 6144 |
| Layers | 60 (3 dense, 57 MoE) |
| Attention heads / KV heads | 64 / 4 (head dim 128, partial RoPE 0.5, θ=5e6) |
| Routed experts (post-REAP32) | 87, top-4 |
| Shared experts | 1 |
| Expert intermediate size | 3072 |
| Sparse indexer | index dim 128, 4 index heads, top-k 16 blocks × 128 |
| Vocab | 200,064 |
| Max context | 1,048,576 (1M) |
| Multimodal | vision tower present in weights, not yet functional (rotary bug) — disabled |
Parameters
Sparse MoE — total ≫ active (only top-4 of 87 routed experts + the shared expert fire per token):
| Component | Params |
|---|---|
| Routed experts (57 MoE layers × 87) | 280.8 B (~96%) |
| Attention (60 layers, incl. sparse index heads) | 6.6 B |
| Shared experts (57 layers) | 3.2 B |
Embeddings + lm_head |
2.5 B |
| Dense MLP (layers 0–2) + router + norms | 0.7 B |
| Total | ≈ 293.8 B |
| Active / token | ≈ 25.9 B |
REAP-pruned from MiniMax-M3's 128 experts to 87 (32% pruned; 42 B less
routed-expert weight than the 100-expert REAP22 build). Active params are
unchanged (26 B — routing still selects 4 experts). At 2.53 bpw the RFI quant
stores this in **84 GB** on disk (plus the 5.2 GB bundled EAGLE3 draft).
Quantization schema — RFI composite v1.0
RFI is a rotation-based composite integer scheme (quant_method: rfi),
Hadamard-32 rotation before quant, per-group group_size 32, symmetric,
block-float int8-mantissa/int8-exponent scales.
| Component | Bits |
|---|---|
Routed MoE experts (block_sparse_moe.experts) |
2-bit (codebook {-10,-3,3,10}) |
| Shared expert · vision tower · projector · patch-merge | 6-bit |
| Attention q/k/v/o · dense MLP | 8-bit |
Router gates · embeddings · lm_head · norms · lightning-indexer index_q/k_proj |
FP16 (unquantized) |
The lightning-indexer projections stay FP16 and un-fused from the QKV GEMM —
folding them into the RFI-packed QKV zeroes them and collapses the DSA block
selection to a recency window (loses context past top_k × block_size). This
fork keeps them as separate FP16 projections.
Running it
Requires the custom fork image (stock vLLM lacks the RFI kernels + M3 sparse-attention path):
- Docker image:
tcclaviger/vllm22— tag:dev.
docker pull tcclaviger/vllm22:dev
Serves on 4× AMD AI Pro R9700 (gfx1201 / RDNA4, 32 GB each; ROCm 7.2.1) at tensor-parallel 4:
docker run --rm \
--device /dev/kfd --device /dev/dri --group-add video \
--cap-add SYS_PTRACE --security-opt seccomp=unconfined \
--ipc host --network host --shm-size 32g \
-e HIP_VISIBLE_DEVICES=0,1,2,3 \
-v /path/to/MiniMax-M3-Coder-REAP32:/app/models \
tcclaviger/vllm22:dev \
/app/models \
--served-model-name M3-Coder-Lite \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.94 \
--max-model-len 100000 \
--kv-cache-dtype fp8 \
--block-size 128 \
--max-num-seqs 50 \
--max-num-batched-tokens 2048 \
--enable-chunked-prefill --enable-prefix-caching \
--chat-template /app/models/chat_template.jinja \
--reasoning-parser minimax_m3 \
--tool-call-parser minimax_m3 --enable-auto-tool-choice \
--default-chat-template-kwargs '{"enable_thinking": true}' \
--speculative-config '{"method": "eagle3", "model": "/app/models/eagle3", "num_speculative_tokens": 4}' \
--host 0.0.0.0 --port 8078
--quantization auto-detects from the checkpoint (quant_method: rfi).
Speculative decoding — bundled EAGLE3 draft (eagle3/)
This repo bundles an RFI6-quantized EAGLE3 draft under eagle3/
(the Inferact drafter,
quantized to 6-bit RFI: attention + MLP + fusion fc, with embeddings/lm_head/
norms BF16, 5.2 GB). Enable it with --speculative-config as shown above; it
shares this model's embedding + LM head at serve time and is lossless (the
target verifies every drafted token — output is byte-identical, only faster).
Measured on this model (4× R9700, num_speculative_tokens=4) — mean accepted
length is tokens emitted per target forward pass (the speedup proxy):
| Workload | Draft accept rate | Mean accepted length | Per-position accept (pos 1–4) |
|---|---|---|---|
| Math / reasoning (low temp) | ~76% | ~4.0 tokens/step | 0.91 / 0.80 / 0.71 / 0.61 |
| Coding (low temp) | ~53% | ~3.1 tokens/step | 0.78 / 0.58 / 0.44 / 0.33 |
| Agentic chat (sampled) | ~36% | ~2.4 tokens/step | 0.64 / 0.40 / 0.25 / 0.14 |
Net ~2.4–4× fewer target forward passes → >2× decode throughput in practice. The draft was trained against the full 128-expert MXFP8 M3, so acceptance on this REAP-pruned RFI target runs below the drafter's published numbers but is still a clear win.
Server-side features (tooling & thinking)
Handled at the server/model level — no client changes required:
- Custom Python tool-call parser (
--tool-call-parser minimax_m3), running on vLLM's Python parsing path — the Rust (vllm-rs) parser is not used. - Qwen-style thinking toggle:
enable_thinkingboolean via--default-chat-template-kwargs(or per-request), mapped to the model's thinking mode by the chat template + reasoning parser. - Guarded thinking tags: both
<think>…</think>and<mm:think>…</mm:think>are accepted, normalized, and resolved server-side (incl. replayed history).
Disclaimer
Experimental, in-development research artifact provided as-is, no warranty. A derivative of MiniMax-M3, subject to the upstream MiniMax-M3 license and usage terms. Evaluate quality and safety yourself before any use.
- Downloads last month
- 68
Model tree for tcclaviger/Minimax-M3-Coder-REAP32-RFI_2
Base model
MiniMaxAI/MiniMax-M3