Qwen3.5-REAP-262B-A17B-W4A16

W4A16 (INT4 weights, BF16 activations) quantized version of OpenMOSE/Qwen3.5-REAP-262B-A17B, created using AutoRound.

Model Details

Property	Value
Base Model	OpenMOSE/Qwen3.5-REAP-262B-A17B
Architecture	Qwen3_5MoeForConditionalGeneration (Mixture-of-Experts + ViT)
Total Parameters	~262B
Active Parameters	~17B per token
Experts	333 total, 10 per token
Layers	60 (45 linear_attention + 15 full_attention)
Quantization	W4A16 (INT4 weights, BF16 activations)
Group Size	128
Method	AutoRound
Format	auto_round:auto_gptq (safetensors)
Required dtype	bfloat16 (float16 causes NaN due to GDN linear attention overflow)

Quantization Details

Calibration dataset: NeelNanda/pile-10k (64 samples, seqlen=512)
Calibration mode: Text-only (vision encoder excluded from calibration)
Vision encoder: Preserved as clean BF16 (27-layer ViT, 1152 hidden, ~6.1GB)
shared_expert_gate layers: Kept at FP16 (60 router gate tensors)
MoE gate (router): Kept at BF16 (60 router weight tensors)
Runtime: ~4h12m on 8x NVIDIA H200 141GB
Peak VRAM: ~85GB across 8 GPUs
Peak RAM: ~804GB

Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output (!!!!!!!) with float16.

Usage with vLLM (Recommended)

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly).

Requirements

GPU with at least 24 GB VRAM (with CPU offloading for the rest)
pip install conch-triton-kernels (required for vision encoder on Blackwell/Hopper GPUs)
A runtime patch for vLLM to handle per-expert weight format (see below)

Docker Compose (Single GPU + CPU Offload)

services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-qwen35-reap
    shm_size: 16g
    ipc: host
    volumes:
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
      - ./patch_vllm.py:/patch_vllm.py
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN:-}
      - VLLM_USE_MODELSCOPE=false
      - PYTHONUNBUFFERED=1
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        pip install conch-triton-kernels && \
        python3 /patch_vllm.py && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
          --dtype bfloat16 \
          --gpu-memory-utilization 0.70 \
          --cpu-offload-gb 70 \
          --max-model-len 2048 \
          --max-num-batched-tokens 2048 \
          --trust-remote-code \
          --enforce-eager \
          --port 8000 \
          --served-model-name Qwen3.5-REAP-262B-A17B-W4A16
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 360
      start_period: 900s

Docker Compose (2x RTX Pro 6000)

If you don't have $HF_HUB_CACHE set you can change to the above defaults
You can try --enforce-eager
--gpu-memory-utilization may be increased
Performance was 800tg/s with concurrency and up to 120tg/s single request
Works with vision using patch_vllm_260320.py

services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-qwen35-reap
    shm_size: 16g
    ipc: host
    volumes:
      - ${HF_HUB_CACHE}:/root/.cache/huggingface
      - ./patch_vllm_260320.py:/patch_vllm.py
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN:-}
      - HF_HUB_CACHE=/root/.cache/huggingface
      - VLLM_USE_MODELSCOPE=false
      - PYTHONUNBUFFERED=1

    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        pip install conch-triton-kernels && \
        python3 /patch_vllm.py && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
          --dtype bfloat16 \
          --tensor-parallel-size 2 \
          --gpu-memory-utilization 0.90 \
          --max-num-batched-tokens 16384 \
          --trust-remote-code \
          --reasoning-parser qwen3 \
          --mm-encoder-tp-mode data \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser qwen3_coder \
          --port 8000 \
          --served-model-name Qwen3.5-REAP-262B-A17B-W4A16
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 360
      start_period: 900s

Required Runtime Patch (`patch_vllm.py` or `patch_vllm_260320.py`)

(patch_vllm_260320.py confirmed working on 2x Pro 6k - original patch_vllm.py will produce errors when actually sending in an image in the message)

vLLM requires a runtime patch to handle this model's per-expert weight format. The patch does three things:

Expert weight name transform: The model stores expert weights as individual per-expert tensors (experts.gate_up_proj.{id}.qweight). The patch splits fused gate_up_proj into separate gate_proj/up_proj tensors and remaps names for vLLM's non-fused expert loader.
CPU offload assertion relaxed: The model uses hybrid attention (linear_attention + full_attention), which triggers a vLLM assertion blocking CPU weight offloading. The assertion is safely bypassed.
Config patching (if not already fixed in repo): Adds block_name_to_quantize and the MoE gate entry to extra_config if missing.

See patch_vllm.py in this repository for the full patch.

Key vLLM Parameters

Parameter	Value	Why
`--dtype bfloat16`	Required	GDN linear attention overflows float16
`--enforce-eager`	Recommended	Avoids torch.compile issues with hybrid attention
`--cpu-offload-gb 70`	For single GPU	Model is ~93 GiB; offloads majority to CPU RAM
`--gpu-memory-utilization 0.70`	For single GPU	Leave headroom for KV cache and activations
`--max-model-len 2048`	Adjustable	Reduces KV cache memory; increase if you have more VRAM
`--max-num-batched-tokens 2048`	Adjustable	Limits FLA kernel memory during Triton autotuning

Test the Server

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-REAP-262B-A17B-W4A16",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Multi-GPU (No CPU Offload)

With 2+ GPUs totaling >100 GB VRAM, you can skip CPU offloading:

python3 -m vllm.entrypoints.openai.api_server \
  --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --enforce-eager \
  --served-model-name Qwen3.5-REAP-262B-A17B-W4A16

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-262B-A17B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Post-Quantization Recovery

The following recovery steps were applied after quantization to ensure multimodal integrity:

Config restoration: config.json restored from source with vision_config, text_config, and correct architectures, with quantization_config injected
Vision encoder recovery: Clean BF16 visual encoder (333 tensors) copied from source model as visual-encoder-clean.safetensors
Tensor key remapping: Keys remapped from model.* to model.language_model.* prefix for compatibility with Qwen3_5MoeForConditionalGeneration
Auxiliary files: preprocessor_config.json, video_preprocessor_config.json, generation_config.json copied from source

File Structure

config.json                        # Multimodal config + quantization_config
model.safetensors                  # W4A16 quantized text/MoE (~130GB, 121,728 tensors)
model.safetensors.index.json       # 122,061 total tensor mappings
visual-encoder-clean.safetensors   # Clean BF16 vision encoder (333 tensors, ~6.1GB)
patch_vllm.py                      # vLLM runtime patch for per-expert weights
quantization_config.json
tokenizer.json
tokenizer_config.json
generation_config.json
preprocessor_config.json
video_preprocessor_config.json
chat_template.jinja

Known Issues and Errata (2026-03-06)

Issues found and fixed in this repository

Issue	Root Cause	Fix Applied
`TokenizersBackend` not found	`tokenizer_config.json` used `TokenizersBackend` class requiring transformers 5.x	Changed to `Qwen2TokenizerFast`
Vision encoder `KeyError` during loading	Missing `block_name_to_quantize` caused INC to create GPTQ params for unquantized vision layers	Added `block_name_to_quantize: "model.language_model.layers"`
`shared_expert_gate` FP16 overrides ignored	`extra_config` keys used `model.layers.X` instead of `model.language_model.layers.X`	Fixed key prefixes
MoE router weights not loaded (NaN output)	MoE gate (`mlp.gate`) stored as BF16 but not marked as unquantized in `extra_config`	Added wildcard `model.language_model.layers.*.mlp.gate` entry
NaN / all `!` output with float16	GDN linear attention intermediate values exceed fp16 range (max 65504)	Set `torch_dtype: "bfloat16"`

Issues requiring vLLM runtime patch

Issue	Root Cause	Workaround
Per-expert weight format not supported	vLLM expects fused `[num_experts, ...]` tensors but model stores individual per-expert tensors	Runtime patch splits `gate_up_proj` and remaps names
CPU offload + hybrid attention assertion	vLLM blocks CPU offloading when model has multiple KV cache group sizes	Runtime patch bypasses assertion
Vision encoder kernel failure on Blackwell	Dimension 4304 incompatible with WNA16 kernels	`pip install conch-triton-kernels`

Acknowledgments

Base model by OpenMOSE
Quantization method: AutoRound by Intel
Hardware: 8x NVIDIA H200 141GB GPUs

Downloads last month: 813

Safetensors

Model size

40B params

Tensor type

I32

BF16

F16

Model tree for atbender/Qwen3.5-REAP-262B-A17B-W4A16

Base model

Qwen/Qwen3.5-397B-A17B

Finetuned

OpenMOSE/Qwen3.5-REAP-262B-A17B

Quantized

(2)

this model