Qwen3.5-REAP-262B-A17B-W4A16
W4A16 (INT4 weights, BF16 activations) quantized version of OpenMOSE/Qwen3.5-REAP-262B-A17B, created using AutoRound.
Model Details
| Property | Value |
|---|---|
| Base Model | OpenMOSE/Qwen3.5-REAP-262B-A17B |
| Architecture | Qwen3_5MoeForConditionalGeneration (Mixture-of-Experts + ViT) |
| Total Parameters | ~262B |
| Active Parameters | ~17B per token |
| Experts | 333 total, 10 per token |
| Layers | 60 (45 linear_attention + 15 full_attention) |
| Quantization | W4A16 (INT4 weights, BF16 activations) |
| Group Size | 128 |
| Method | AutoRound |
| Format | auto_round:auto_gptq (safetensors) |
| Required dtype | bfloat16 (float16 causes NaN due to GDN linear attention overflow) |
Quantization Details
- Calibration dataset: NeelNanda/pile-10k (64 samples, seqlen=512)
- Calibration mode: Text-only (vision encoder excluded from calibration)
- Vision encoder: Preserved as clean BF16 (27-layer ViT, 1152 hidden, ~6.1GB)
- shared_expert_gate layers: Kept at FP16 (60 router gate tensors)
- MoE gate (router): Kept at BF16 (60 router weight tensors)
- Runtime: ~4h12m on 8x NVIDIA H200 141GB
- Peak VRAM: ~85GB across 8 GPUs
- Peak RAM: ~804GB
Important: dtype Must Be bfloat16
This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output (!!!!!!!) with float16.
Usage with vLLM (Recommended)
Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly).
Requirements
- GPU with at least 24 GB VRAM (with CPU offloading for the rest)
pip install conch-triton-kernels(required for vision encoder on Blackwell/Hopper GPUs)- A runtime patch for vLLM to handle per-expert weight format (see below)
Docker Compose (Single GPU + CPU Offload)
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: vllm-qwen35-reap
shm_size: 16g
ipc: host
volumes:
- ${HOME}/.cache/huggingface:/root/.cache/huggingface
- ./patch_vllm.py:/patch_vllm.py
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN:-}
- VLLM_USE_MODELSCOPE=false
- PYTHONUNBUFFERED=1
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
entrypoint: ["/bin/bash", "-c"]
command:
- |
pip install conch-triton-kernels && \
python3 /patch_vllm.py && \
python3 -m vllm.entrypoints.openai.api_server \
--model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
--dtype bfloat16 \
--gpu-memory-utilization 0.70 \
--cpu-offload-gb 70 \
--max-model-len 2048 \
--max-num-batched-tokens 2048 \
--trust-remote-code \
--enforce-eager \
--port 8000 \
--served-model-name Qwen3.5-REAP-262B-A17B-W4A16
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 10s
timeout: 5s
retries: 360
start_period: 900s
Docker Compose (2x RTX Pro 6000)
- If you don't have
$HF_HUB_CACHEset you can change to the above defaults - You can try
--enforce-eager --gpu-memory-utilizationmay be increased- Performance was 800tg/s with concurrency and up to 120tg/s single request
- Works with vision using
patch_vllm_260320.py
services:
vllm:
image: vllm/vllm-openai:cu130-nightly
container_name: vllm-qwen35-reap
shm_size: 16g
ipc: host
volumes:
- ${HF_HUB_CACHE}:/root/.cache/huggingface
- ./patch_vllm_260320.py:/patch_vllm.py
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN:-}
- HF_HUB_CACHE=/root/.cache/huggingface
- VLLM_USE_MODELSCOPE=false
- PYTHONUNBUFFERED=1
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0", "1"]
capabilities: [gpu]
entrypoint: ["/bin/bash", "-c"]
command:
- |
pip install conch-triton-kernels && \
python3 /patch_vllm.py && \
python3 -m vllm.entrypoints.openai.api_server \
--model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 16384 \
--trust-remote-code \
--reasoning-parser qwen3 \
--mm-encoder-tp-mode data \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--port 8000 \
--served-model-name Qwen3.5-REAP-262B-A17B-W4A16
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 10s
timeout: 5s
retries: 360
start_period: 900s
Required Runtime Patch (patch_vllm.py or patch_vllm_260320.py)
(patch_vllm_260320.py confirmed working on 2x Pro 6k - original patch_vllm.py will produce errors when actually sending in an image in the message)
vLLM requires a runtime patch to handle this model's per-expert weight format. The patch does three things:
Expert weight name transform: The model stores expert weights as individual per-expert tensors (
experts.gate_up_proj.{id}.qweight). The patch splits fusedgate_up_projinto separategate_proj/up_projtensors and remaps names for vLLM's non-fused expert loader.CPU offload assertion relaxed: The model uses hybrid attention (linear_attention + full_attention), which triggers a vLLM assertion blocking CPU weight offloading. The assertion is safely bypassed.
Config patching (if not already fixed in repo): Adds
block_name_to_quantizeand the MoE gate entry toextra_configif missing.
See patch_vllm.py in this repository for the full patch.
Key vLLM Parameters
| Parameter | Value | Why |
|---|---|---|
--dtype bfloat16 |
Required | GDN linear attention overflows float16 |
--enforce-eager |
Recommended | Avoids torch.compile issues with hybrid attention |
--cpu-offload-gb 70 |
For single GPU | Model is ~93 GiB; offloads majority to CPU RAM |
--gpu-memory-utilization 0.70 |
For single GPU | Leave headroom for KV cache and activations |
--max-model-len 2048 |
Adjustable | Reduces KV cache memory; increase if you have more VRAM |
--max-num-batched-tokens 2048 |
Adjustable | Limits FLA kernel memory during Triton autotuning |
Test the Server
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-REAP-262B-A17B-W4A16",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 512,
"temperature": 0.7
}'
Multi-GPU (No CPU Offload)
With 2+ GPUs totaling >100 GB VRAM, you can skip CPU offloading:
python3 -m vllm.entrypoints.openai.api_server \
--model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
--dtype bfloat16 \
--tensor-parallel-size 2 \
--trust-remote-code \
--enforce-eager \
--served-model-name Qwen3.5-REAP-262B-A17B-W4A16
Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "atbender/Qwen3.5-REAP-262B-A17B-W4A16"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
Post-Quantization Recovery
The following recovery steps were applied after quantization to ensure multimodal integrity:
- Config restoration:
config.jsonrestored from source withvision_config,text_config, and correctarchitectures, withquantization_configinjected - Vision encoder recovery: Clean BF16 visual encoder (333 tensors) copied from source model as
visual-encoder-clean.safetensors - Tensor key remapping: Keys remapped from
model.*tomodel.language_model.*prefix for compatibility withQwen3_5MoeForConditionalGeneration - Auxiliary files:
preprocessor_config.json,video_preprocessor_config.json,generation_config.jsoncopied from source
File Structure
config.json # Multimodal config + quantization_config
model.safetensors # W4A16 quantized text/MoE (~130GB, 121,728 tensors)
model.safetensors.index.json # 122,061 total tensor mappings
visual-encoder-clean.safetensors # Clean BF16 vision encoder (333 tensors, ~6.1GB)
patch_vllm.py # vLLM runtime patch for per-expert weights
quantization_config.json
tokenizer.json
tokenizer_config.json
generation_config.json
preprocessor_config.json
video_preprocessor_config.json
chat_template.jinja
Known Issues and Errata (2026-03-06)
Issues found and fixed in this repository
| Issue | Root Cause | Fix Applied |
|---|---|---|
TokenizersBackend not found |
tokenizer_config.json used TokenizersBackend class requiring transformers 5.x |
Changed to Qwen2TokenizerFast |
Vision encoder KeyError during loading |
Missing block_name_to_quantize caused INC to create GPTQ params for unquantized vision layers |
Added block_name_to_quantize: "model.language_model.layers" |
shared_expert_gate FP16 overrides ignored |
extra_config keys used model.layers.X instead of model.language_model.layers.X |
Fixed key prefixes |
| MoE router weights not loaded (NaN output) | MoE gate (mlp.gate) stored as BF16 but not marked as unquantized in extra_config |
Added wildcard model.language_model.layers.*.mlp.gate entry |
NaN / all ! output with float16 |
GDN linear attention intermediate values exceed fp16 range (max 65504) | Set torch_dtype: "bfloat16" |
Issues requiring vLLM runtime patch
| Issue | Root Cause | Workaround |
|---|---|---|
| Per-expert weight format not supported | vLLM expects fused [num_experts, ...] tensors but model stores individual per-expert tensors |
Runtime patch splits gate_up_proj and remaps names |
| CPU offload + hybrid attention assertion | vLLM blocks CPU offloading when model has multiple KV cache group sizes | Runtime patch bypasses assertion |
| Vision encoder kernel failure on Blackwell | Dimension 4304 incompatible with WNA16 kernels | pip install conch-triton-kernels |
Acknowledgments
- Downloads last month
- 813
Model tree for atbender/Qwen3.5-REAP-262B-A17B-W4A16
Base model
Qwen/Qwen3.5-397B-A17B