Qwen3.5-REAP-262B-A17B-W4A16

W4A16 (INT4 weights, BF16 activations) quantized version of OpenMOSE/Qwen3.5-REAP-262B-A17B, created using AutoRound.

Model Details

Property Value
Base Model OpenMOSE/Qwen3.5-REAP-262B-A17B
Architecture Qwen3_5MoeForConditionalGeneration (Mixture-of-Experts + ViT)
Total Parameters ~262B
Active Parameters ~17B per token
Experts 333 total, 10 per token
Layers 60 (45 linear_attention + 15 full_attention)
Quantization W4A16 (INT4 weights, BF16 activations)
Group Size 128
Method AutoRound
Format auto_round:auto_gptq (safetensors)
Required dtype bfloat16 (float16 causes NaN due to GDN linear attention overflow)

Quantization Details

  • Calibration dataset: NeelNanda/pile-10k (64 samples, seqlen=512)
  • Calibration mode: Text-only (vision encoder excluded from calibration)
  • Vision encoder: Preserved as clean BF16 (27-layer ViT, 1152 hidden, ~6.1GB)
  • shared_expert_gate layers: Kept at FP16 (60 router gate tensors)
  • MoE gate (router): Kept at BF16 (60 router weight tensors)
  • Runtime: ~4h12m on 8x NVIDIA H200 141GB
  • Peak VRAM: ~85GB across 8 GPUs
  • Peak RAM: ~804GB

Important: dtype Must Be bfloat16

This model uses hybrid attention with GDN (Gated Delta Network) linear attention layers. These layers produce intermediate values that exceed float16's dynamic range (max 65504), causing NaN outputs. You must use --dtype bfloat16 (or torch_dtype=torch.bfloat16). The model will produce garbage output (!!!!!!!) with float16.

Usage with vLLM (Recommended)

Tested with vLLM v0.17.0rc1 (vllm/vllm-openai:cu130-nightly).

Requirements

  • GPU with at least 24 GB VRAM (with CPU offloading for the rest)
  • pip install conch-triton-kernels (required for vision encoder on Blackwell/Hopper GPUs)
  • A runtime patch for vLLM to handle per-expert weight format (see below)

Docker Compose (Single GPU + CPU Offload)

services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-qwen35-reap
    shm_size: 16g
    ipc: host
    volumes:
      - ${HOME}/.cache/huggingface:/root/.cache/huggingface
      - ./patch_vllm.py:/patch_vllm.py
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN:-}
      - VLLM_USE_MODELSCOPE=false
      - PYTHONUNBUFFERED=1
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        pip install conch-triton-kernels && \
        python3 /patch_vllm.py && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
          --dtype bfloat16 \
          --gpu-memory-utilization 0.70 \
          --cpu-offload-gb 70 \
          --max-model-len 2048 \
          --max-num-batched-tokens 2048 \
          --trust-remote-code \
          --enforce-eager \
          --port 8000 \
          --served-model-name Qwen3.5-REAP-262B-A17B-W4A16
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 360
      start_period: 900s

Docker Compose (2x RTX Pro 6000)

  • If you don't have $HF_HUB_CACHE set you can change to the above defaults
  • You can try --enforce-eager
  • --gpu-memory-utilization may be increased
  • Performance was 800tg/s with concurrency and up to 120tg/s single request
  • Works with vision using patch_vllm_260320.py
services:
  vllm:
    image: vllm/vllm-openai:cu130-nightly
    container_name: vllm-qwen35-reap
    shm_size: 16g
    ipc: host
    volumes:
      - ${HF_HUB_CACHE}:/root/.cache/huggingface
      - ./patch_vllm_260320.py:/patch_vllm.py
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN:-}
      - HF_HUB_CACHE=/root/.cache/huggingface
      - VLLM_USE_MODELSCOPE=false
      - PYTHONUNBUFFERED=1

    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0", "1"]
              capabilities: [gpu]
    entrypoint: ["/bin/bash", "-c"]
    command:
      - |
        pip install conch-triton-kernels && \
        python3 /patch_vllm.py && \
        python3 -m vllm.entrypoints.openai.api_server \
          --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
          --dtype bfloat16 \
          --tensor-parallel-size 2 \
          --gpu-memory-utilization 0.90 \
          --max-num-batched-tokens 16384 \
          --trust-remote-code \
          --reasoning-parser qwen3 \
          --mm-encoder-tp-mode data \
          --enable-prefix-caching \
          --enable-auto-tool-choice \
          --tool-call-parser qwen3_coder \
          --port 8000 \
          --served-model-name Qwen3.5-REAP-262B-A17B-W4A16
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 360
      start_period: 900s

Required Runtime Patch (patch_vllm.py or patch_vllm_260320.py)

(patch_vllm_260320.py confirmed working on 2x Pro 6k - original patch_vllm.py will produce errors when actually sending in an image in the message)

vLLM requires a runtime patch to handle this model's per-expert weight format. The patch does three things:

  1. Expert weight name transform: The model stores expert weights as individual per-expert tensors (experts.gate_up_proj.{id}.qweight). The patch splits fused gate_up_proj into separate gate_proj/up_proj tensors and remaps names for vLLM's non-fused expert loader.

  2. CPU offload assertion relaxed: The model uses hybrid attention (linear_attention + full_attention), which triggers a vLLM assertion blocking CPU weight offloading. The assertion is safely bypassed.

  3. Config patching (if not already fixed in repo): Adds block_name_to_quantize and the MoE gate entry to extra_config if missing.

See patch_vllm.py in this repository for the full patch.

Key vLLM Parameters

Parameter Value Why
--dtype bfloat16 Required GDN linear attention overflows float16
--enforce-eager Recommended Avoids torch.compile issues with hybrid attention
--cpu-offload-gb 70 For single GPU Model is ~93 GiB; offloads majority to CPU RAM
--gpu-memory-utilization 0.70 For single GPU Leave headroom for KV cache and activations
--max-model-len 2048 Adjustable Reduces KV cache memory; increase if you have more VRAM
--max-num-batched-tokens 2048 Adjustable Limits FLA kernel memory during Triton autotuning

Test the Server

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.5-REAP-262B-A17B-W4A16",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 512,
    "temperature": 0.7
  }'

Multi-GPU (No CPU Offload)

With 2+ GPUs totaling >100 GB VRAM, you can skip CPU offloading:

python3 -m vllm.entrypoints.openai.api_server \
  --model atbender/Qwen3.5-REAP-262B-A17B-W4A16 \
  --dtype bfloat16 \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --enforce-eager \
  --served-model-name Qwen3.5-REAP-262B-A17B-W4A16

Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "atbender/Qwen3.5-REAP-262B-A17B-W4A16"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Hello, how are you?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
inputs = inputs.to(model.device)

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

Post-Quantization Recovery

The following recovery steps were applied after quantization to ensure multimodal integrity:

  1. Config restoration: config.json restored from source with vision_config, text_config, and correct architectures, with quantization_config injected
  2. Vision encoder recovery: Clean BF16 visual encoder (333 tensors) copied from source model as visual-encoder-clean.safetensors
  3. Tensor key remapping: Keys remapped from model.* to model.language_model.* prefix for compatibility with Qwen3_5MoeForConditionalGeneration
  4. Auxiliary files: preprocessor_config.json, video_preprocessor_config.json, generation_config.json copied from source

File Structure

config.json                        # Multimodal config + quantization_config
model.safetensors                  # W4A16 quantized text/MoE (~130GB, 121,728 tensors)
model.safetensors.index.json       # 122,061 total tensor mappings
visual-encoder-clean.safetensors   # Clean BF16 vision encoder (333 tensors, ~6.1GB)
patch_vllm.py                      # vLLM runtime patch for per-expert weights
quantization_config.json
tokenizer.json
tokenizer_config.json
generation_config.json
preprocessor_config.json
video_preprocessor_config.json
chat_template.jinja

Known Issues and Errata (2026-03-06)

Issues found and fixed in this repository

Issue Root Cause Fix Applied
TokenizersBackend not found tokenizer_config.json used TokenizersBackend class requiring transformers 5.x Changed to Qwen2TokenizerFast
Vision encoder KeyError during loading Missing block_name_to_quantize caused INC to create GPTQ params for unquantized vision layers Added block_name_to_quantize: "model.language_model.layers"
shared_expert_gate FP16 overrides ignored extra_config keys used model.layers.X instead of model.language_model.layers.X Fixed key prefixes
MoE router weights not loaded (NaN output) MoE gate (mlp.gate) stored as BF16 but not marked as unquantized in extra_config Added wildcard model.language_model.layers.*.mlp.gate entry
NaN / all ! output with float16 GDN linear attention intermediate values exceed fp16 range (max 65504) Set torch_dtype: "bfloat16"

Issues requiring vLLM runtime patch

Issue Root Cause Workaround
Per-expert weight format not supported vLLM expects fused [num_experts, ...] tensors but model stores individual per-expert tensors Runtime patch splits gate_up_proj and remaps names
CPU offload + hybrid attention assertion vLLM blocks CPU offloading when model has multiple KV cache group sizes Runtime patch bypasses assertion
Vision encoder kernel failure on Blackwell Dimension 4304 incompatible with WNA16 kernels pip install conch-triton-kernels

Acknowledgments

  • Base model by OpenMOSE
  • Quantization method: AutoRound by Intel
  • Hardware: 8x NVIDIA H200 141GB GPUs
Downloads last month
813
Safetensors
Model size
40B params
Tensor type
I32
BF16
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for atbender/Qwen3.5-REAP-262B-A17B-W4A16

Quantized
(2)
this model