⚠️ Vision is not yet working — text-only for now.

Image/video input is not functional in this build. The MiniMax-M3 vision tower has an unresolved rotary-embedding bug, so multimodal input is disabled in the serving config. Text generation works; treat this as a text-only coder model until vision is fixed.

MiniMax-M3-Coder-REAP32 (RFI)

A more-aggressively-pruned REAP32 build of MiniMax-M3 — 87 of 128 routed experts kept (32% pruned) — RFI-composite-quantized and packaged to serve on AMD ROCm / RDNA4 (gfx1201) through a custom vLLM fork. Smaller and faster than the 100-expert REAP22 build, with a bundled EAGLE3 drafter for speculative decoding.

⚠️ Experimental — in active development

Functional but experimental: it runs and produces coherent long-context output, but it has not been broadly evaluated, the quant/prune recipe is still being tuned, and behavior may change between revisions. Not for production.


What this is

A derivative of MiniMaxAI/MiniMax-M3 pruned to the 87-expert REAP32 expert set and quantized with the RFI composite scheme:

  1. Expert set = REAP32 (87/128). The exact experts kept per layer match JANGQ-AI/MiniMax-M3-REAP32-Coder. We recovered that inventory without any calibration or forward passes: the MoE router-gate rows are per-expert fingerprints preserved from the base model, so matching our rows against REAP32's (exact, cosine ≈ 1.0) identifies precisely which experts to drop. This build was produced by reaping our 100-expert REAP22 model down to those 87 survivors (13 dropped per MoE layer), renumbering survivors and slicing the router gate + selection bias in lockstep.
  2. RFI composite quantization (details below) — routed experts 2-bit, most else 6/8-bit, with the router / embeddings / lm_head / norms / lightning indexer kept FP16.
  3. Runtime + EAGLE3 speculative decoding via the tcclaviger/vllm22:dev fork.

Sources & credits

Role Source
Original base model (derivative of) MiniMaxAI/MiniMax-M3
REAP32 expert inventory (which 87 experts) JANGQ-AI/MiniMax-M3-REAP32-Coder
REAP22 lineage + mixed-quant inspiration JANGQ-AI/MiniMax-M3-REAP22-Coder
Concept inspiration (M3 Coder build) JANG2_L (vMLX)
REAP pruning method Cerebras — REAP (ICLR 2026, arXiv:2510.13999)
EAGLE3 drafter (bundled, eagle3/) Inferact/MiniMax-M3-EAGLE3 (TorchSpec)
Router-gate cross-reference, RFI quant, reap + build tcclaviger

All rights and the license belong to MiniMaxAI; this is a derivative work and inherits the MiniMax-M3 license.


Architecture (post-prune)

Field Value
Backbone MiniMax-M3 sparse MoE (DSA "lightning indexer" attention)
Hidden size 6144
Layers 60 (3 dense, 57 MoE)
Attention heads / KV heads 64 / 4 (head dim 128, partial RoPE 0.5, θ=5e6)
Routed experts (post-REAP32) 87, top-4
Shared experts 1
Expert intermediate size 3072
Sparse indexer index dim 128, 4 index heads, top-k 16 blocks × 128
Vocab 200,064
Max context 1,048,576 (1M)
Multimodal vision tower present in weights, not yet functional (rotary bug) — disabled

Parameters

Sparse MoE — total ≫ active (only top-4 of 87 routed experts + the shared expert fire per token):

Component Params
Routed experts (57 MoE layers × 87) 280.8 B (~96%)
Attention (60 layers, incl. sparse index heads) 6.6 B
Shared experts (57 layers) 3.2 B
Embeddings + lm_head 2.5 B
Dense MLP (layers 0–2) + router + norms 0.7 B
Total ≈ 293.8 B
Active / token ≈ 25.9 B

REAP-pruned from MiniMax-M3's 128 experts to 87 (32% pruned; 42 B less routed-expert weight than the 100-expert REAP22 build). Active params are unchanged (26 B — routing still selects 4 experts). At 2.53 bpw the RFI quant stores this in **84 GB** on disk (plus the 5.2 GB bundled EAGLE3 draft).


Quantization schema — RFI composite v1.0

RFI is a rotation-based composite integer scheme (quant_method: rfi), Hadamard-32 rotation before quant, per-group group_size 32, symmetric, block-float int8-mantissa/int8-exponent scales.

Component Bits
Routed MoE experts (block_sparse_moe.experts) 2-bit (codebook {-10,-3,3,10})
Shared expert · vision tower · projector · patch-merge 6-bit
Attention q/k/v/o · dense MLP 8-bit
Router gates · embeddings · lm_head · norms · lightning-indexer index_q/k_proj FP16 (unquantized)

The lightning-indexer projections stay FP16 and un-fused from the QKV GEMM — folding them into the RFI-packed QKV zeroes them and collapses the DSA block selection to a recency window (loses context past top_k × block_size). This fork keeps them as separate FP16 projections.


Running it

Requires the custom fork image (stock vLLM lacks the RFI kernels + M3 sparse-attention path):

docker pull tcclaviger/vllm22:dev

Serves on 4× AMD AI Pro R9700 (gfx1201 / RDNA4, 32 GB each; ROCm 7.2.1) at tensor-parallel 4:

docker run --rm \
  --device /dev/kfd --device /dev/dri --group-add video \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ipc host --network host --shm-size 32g \
  -e HIP_VISIBLE_DEVICES=0,1,2,3 \
  -v /path/to/MiniMax-M3-Coder-REAP32:/app/models \
  tcclaviger/vllm22:dev \
  /app/models \
    --served-model-name M3-Coder-Lite \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.94 \
    --max-model-len 100000 \
    --kv-cache-dtype fp8 \
    --block-size 128 \
    --max-num-seqs 50 \
    --max-num-batched-tokens 2048 \
    --enable-chunked-prefill --enable-prefix-caching \
    --chat-template /app/models/chat_template.jinja \
    --reasoning-parser minimax_m3 \
    --tool-call-parser minimax_m3 --enable-auto-tool-choice \
    --default-chat-template-kwargs '{"enable_thinking": true}' \
    --speculative-config '{"method": "eagle3", "model": "/app/models/eagle3", "num_speculative_tokens": 4}' \
    --host 0.0.0.0 --port 8078

--quantization auto-detects from the checkpoint (quant_method: rfi).

Speculative decoding — bundled EAGLE3 draft (eagle3/)

This repo bundles an RFI6-quantized EAGLE3 draft under eagle3/ (the Inferact drafter, quantized to 6-bit RFI: attention + MLP + fusion fc, with embeddings/lm_head/ norms BF16, 5.2 GB). Enable it with --speculative-config as shown above; it shares this model's embedding + LM head at serve time and is lossless (the target verifies every drafted token — output is byte-identical, only faster).

Measured on this model (4× R9700, num_speculative_tokens=4) — mean accepted length is tokens emitted per target forward pass (the speedup proxy):

Workload Draft accept rate Mean accepted length Per-position accept (pos 1–4)
Math / reasoning (low temp) ~76% ~4.0 tokens/step 0.91 / 0.80 / 0.71 / 0.61
Coding (low temp) ~53% ~3.1 tokens/step 0.78 / 0.58 / 0.44 / 0.33
Agentic chat (sampled) ~36% ~2.4 tokens/step 0.64 / 0.40 / 0.25 / 0.14

Net ~2.4–4× fewer target forward passes>2× decode throughput in practice. The draft was trained against the full 128-expert MXFP8 M3, so acceptance on this REAP-pruned RFI target runs below the drafter's published numbers but is still a clear win.

Server-side features (tooling & thinking)

Handled at the server/model level — no client changes required:

  • Custom Python tool-call parser (--tool-call-parser minimax_m3), running on vLLM's Python parsing path — the Rust (vllm-rs) parser is not used.
  • Qwen-style thinking toggle: enable_thinking boolean via --default-chat-template-kwargs (or per-request), mapped to the model's thinking mode by the chat template + reasoning parser.
  • Guarded thinking tags: both <think>…</think> and <mm:think>…</mm:think> are accepted, normalized, and resolved server-side (incl. replayed history).

Disclaimer

Experimental, in-development research artifact provided as-is, no warranty. A derivative of MiniMax-M3, subject to the upstream MiniMax-M3 license and usage terms. Evaluate quality and safety yourself before any use.

Downloads last month
68
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tcclaviger/Minimax-M3-Coder-REAP32-RFI_2

Finetuned
(10)
this model

Paper for tcclaviger/Minimax-M3-Coder-REAP32-RFI_2