Bonsai

Prism ML Website  |  Whitepaper  |  Demo & Examples  |  Colab Notebook  |  Discord

Bonsai-4B-GGUF-1bit

End-to-end 1-bit language model for llama.cpp (CUDA, Metal, CPU)

14.1x smaller than FP16 | 4.2x faster on RTX 4090 | runs on any device

Highlights

  • Deployed footprint fits on virtually any device with a GPU
  • End-to-end 1-bit weights across embeddings, attention projections, MLP projections, and LM head
  • GGUF Q1_0_g128 format for 1-bit packing of weights with shared scales for each group (group size 128).
  • Cross-platform: CUDA (RTX/datacenter), Metal (Mac), Swift (iPhone/iPad), Android
  • MLX companion: also available as MLX 1-bit g128 for native Apple Silicon inference

Frontier Efficiency

Resources

  • Google Colab — try Bonsai in your browser, no setup required
  • Whitepaper — for more details on Bonsai, check out our whitepaper
  • Demo repo — comprehensive examples for serving, benchmarking, and integrating Bonsai
  • Discord — join the community for support, discussion, and updates
  • 1-bit kernels: llama.cpp fork (CUDA + Metal) · MLX fork (Apple Silicon) · mlx-swift fork (iOS/macOS)
  • Locally AI — we have partnered with Locally AI for iPhone support

Model Overview

Item Specification
Parameters 4.0B (~3.6B non-embedding)
Architecture Qwen3-4B dense: GQA (32 query / 8 KV heads), SwiGLU MLP, RoPE, RMSNorm
Layers 36 Transformer decoder blocks
Context length 32,768 tokens
Vocab size 151,936
Weight format GGUF Q1_0_g128
Deployed size 0.57 GB (14.2x smaller than FP16)
1-bit coverage Embeddings, attention projections, MLP projections, LM head
License Apache 2.0

Quantization Format: Q1_0_g128

Each weight is a single bit: 0 maps to −scale, 1 maps to +scale. Every group of 128 weights shares one FP16 scale factor.

Effective bits per weight: 1.125 (1 sign bit + 16-bit scale amortized over 128 weights).

Memory Requirement

Parameter memory only (weights and scales loaded into memory):

Format Size Reduction Ratio
FP16 8.04 GB — 1.0x
GGUF Q1_0_g128 0.57 GB 93.0% 14.2x
MLX 1-bit g128 0.63 GB 92.2% 12.8x

The GGUF file on disk is 0.57 GB (~6.4 MB larger) because the format embeds the tokenizer, chat template, and model metadata alongside the weights.

Best Practices

Generation Parameters

Parameter Default Suggested range
Temperature 0.5 0.5 -- 0.7
Top-k 20 20 -- 40
Top-p 0.9 0.85 -- 0.95
Repetition penalty 1.0
Presence penalty 0.0

System Prompt

You can use a simple system prompt such as:

You are a helpful assistant

Quickstart

llama.cpp (CUDA)

# Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build -DGGML_CUDA=ON && cmake --build build -j

# Run inference
./build/bin/llama-cli \
    -m Bonsai-4B-Q1_0_g128.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp (Metal / macOS)

# Clone the PrismML fork of llama.cpp (includes Q1_0_g128 kernels)
git clone https://github.com/PrismML-Eng/llama.cpp
cd llama.cpp

# Build with Metal support (default on macOS)
cmake -B build && cmake --build build -j

# Run inference
./build/bin/llama-cli \
    -m Bonsai-4B-Q1_0_g128.gguf \
    -p "Explain quantum computing in simple terms." \
    -n 256 \
    --temp 0.5 \
    --top-p 0.85 \
    --top-k 20 \
    -ngl 99

llama.cpp Server

./build/bin/llama-server \
    -m Bonsai-4B-Q1_0_g128.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Open the web UI at http://127.0.0.1:8080, or see our llama.cpp fork for more examples.

Cross-Platform Throughput

Platform Backend TG128 (tok/s) FP16 TG (tok/s) TG vs FP16 PP512 (tok/s) FP16 PP512 (tok/s)
RTX 4090 llama.cpp CUDA 440 105 4.2x 18,135 16,310
M4 Pro 48 GB llama.cpp Metal 136 29 4.7x 915 915

Citation

If you use 1-bit Bonsai 4B, please cite:

@techreport{bonsai,
    title   = {Bonsai: End-to-End 1-bit Language Model Deployment
               Across Apple, GPU, and Mobile Runtimes},
    author  = {Prism ML},
    year    = {2026},
    month   = {March},
    url     = {https://prismml.com}
}

Contact

For questions, feedback, or collaboration inquiries: contact@prismml.com

Downloads last month
1,955
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prism-ml/Bonsai-4B-gguf

Quantizations
1 model

Collection including prism-ml/Bonsai-4B-gguf