Qwen3 235B-A22B (MoE)
Qwen3 235B-A22B (MoE) needs roughly 149.9 GB VRAM at Q4_K_M quantization (528.2 GB at FP16). 20 GPUs we track can run it fully in VRAM at 8k context.
20 GPUs run this natively · 2 with CPU offload
Qwen3 235B-A22B (MoE) is a Mixture of Experts (MoE) model with 235B total parameters but only 22B active per token developed by Alibaba. April 2025 flagship MoE with 235B total parameters but only 22B active. Reasoning-optimized architecture.
To run Qwen3 235B-A22B (MoE) locally: Q2_K needs ~100-120GB VRAM — multi-GPU server or Mac Studio M2/M3 Ultra required. As a MoE model, inference speed depends on active parameters (22B) rather than total size.
MoE efficiency with frontier-scale quality — designed for complex reasoning and agentic workflows.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 940.0 GB | 1.58 GB | 1054.6 GB |
| BF16 | 470.0 GB | 1.58 GB | 528.2 GB |
| FP16 | 470.0 GB | 1.58 GB | 528.2 GB |
| Q8_0 | 235.0 GB | 1.58 GB | 265.0 GB |
| Q6_K | 192.7 GB | 1.58 GB | 217.6 GB |
| Q5_K_M | 151.3 GB | 1.58 GB | 171.3 GB |
| Q4_K_M | 132.3 GB | 1.58 GB | 149.9 GB |
| Q3_K_M | 101.0 GB | 1.58 GB | 114.9 GB |
| Q2_Krec | 77.3 GB | 1.58 GB | 88.4 GB |
| NVFP4cuda | 117.5 GB | 1.58 GB | 133.4 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Qwen3 235B-A22B (MoE) natively (20)
- NVIDIA RTX Pro 6000Q2_K · 204.3 t/s
- NVIDIA DGX Spark (128GB)Q3_K_M · 31.7 t/s
- AMD Instinct MI300XQ5_K_M · 411.5 t/s
- AMD Strix Halo (128GB)Q3_K_M · 29.8 t/s
- AMD Strix Halo (96GB)Q2_K · 38.9 t/s
- Apple M5 Max (128GB)Q3_K_M · 71.4 t/s
- Apple M4 Ultra (384GB)Q8_0 · 54.6 t/s
- Apple M4 Ultra (192GB)Q5_K_M · 84.8 t/s
- Apple M4 Max (128GB)Q3_K_M · 63.5 t/s
- Apple M4 Max (96GB)Q2_K · 83 t/s
- Apple M3 Ultra (512GB)Q8_0 · 41 t/s
- Apple M3 Ultra (256GB)Q6_K · 49.9 t/s
- Apple M3 Ultra (96GB)Q2_K · 124.5 t/s
- Apple M3 Max (128GB)Q3_K_M · 46.5 t/s
- Apple M3 Max (96GB)Q2_K · 60.8 t/s
- Apple M2 Ultra (384GB)Q8_0 · 40 t/s
- Apple M2 Ultra (192GB)Q5_K_M · 62.1 t/s
- Apple M2 Max (96GB)Q2_K · 60.8 t/s
- Apple M1 Ultra (128GB)Q3_K_M · 93 t/s
- Intel Data Center GPU Max 1550Q3_K_M · 380.9 t/s
Plus 2 GPUs that run it with CPU offload (slower)
- NVIDIA H100 80GBQ2_K · 115.7 t/s
- NVIDIA A100 80GBQ2_K · 70.4 t/s
Notes
Flagship reasoning MoE: 235B total, 22B active. Needs multi-GPU server.
Compare Qwen3 235B-A22B (MoE) with other models
Frequently asked questions
- What are the VRAM requirements for Qwen3 235B-A22B (MoE)?
- Qwen3 235B-A22B (MoE) requires approximately 149.9 GB of VRAM at Q4_K_M quantization, 265.0 GB at Q8, and 528.2 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Qwen3 235B-A22B (MoE) have?
- Qwen3 235B-A22B (MoE) has 235 billion total parameters, but only 22 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is Qwen3 235B-A22B (MoE)?
- Qwen3 235B-A22B (MoE) achieves an MMLU-Pro score of 84.4, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can Qwen3 235B-A22B (MoE) run on a 16 GB GPU?
- No. At Q4_K_M, Qwen3 235B-A22B (MoE) needs 149.9 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can Qwen3 235B-A22B (MoE) run on a 24 GB GPU?
- No. Even at Q4_K_M, Qwen3 235B-A22B (MoE) needs 149.9 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for Qwen3 235B-A22B (MoE) that fits in 24 GB of VRAM?
- Qwen3 235B-A22B (MoE) cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 88.4 GB at Q2_K.
- What GPU do I need to run Qwen3 235B-A22B (MoE) locally?
- You need a multi-GPU server. At Q4_K_M, Qwen3 235B-A22B (MoE) needs 149.9 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.