Qwen3 30B-A3B (MoE)
Qwen3 30B-A3B (MoE) needs roughly 19.8 GB VRAM at Q4_K_M quantization (68.1 GB at FP16). 85 GPUs we track can run it fully in VRAM at 8k context.
85 GPUs run this natively · 19 with CPU offload
Qwen3 30B-A3B (MoE) is a Mixture of Experts (MoE) model with 30B total parameters but only 3B active per token developed by Alibaba. Ultra-efficient MoE with 30B total parameters but only 3B active per token.
To run Qwen3 30B-A3B (MoE) locally: Q4_K_M needs ~18-20GB — runs on 24GB GPUs with excellent speed due to low active parameter count. As a MoE model, inference speed depends on active parameters (3B) rather than total size.
MoE architecture delivers 30B-class quality at 3B inference cost — exceptional tokens/sec when it fits.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 120.0 GB | 0.81 GB | 135.3 GB |
| BF16 | 60.0 GB | 0.81 GB | 68.1 GB |
| FP16 | 60.0 GB | 0.81 GB | 68.1 GB |
| Q8_0 | 30.0 GB | 0.81 GB | 34.5 GB |
| Q6_K | 24.6 GB | 0.81 GB | 28.4 GB |
| Q5_K_M | 19.3 GB | 0.81 GB | 22.5 GB |
| Q4_K_Mrec | 16.9 GB | 0.81 GB | 19.8 GB |
| Q3_K_M | 12.9 GB | 0.81 GB | 15.3 GB |
| Q2_K | 9.9 GB | 0.81 GB | 12.0 GB |
| NVFP4cuda | 15.0 GB | 0.81 GB | 17.7 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Qwen3 30B-A3B (MoE) natively (85)
- NVIDIA RTX 5090NVFP4 · 1314.1 t/s
- NVIDIA RTX 5080Q2_K · 1069.9 t/s
- NVIDIA RTX 5070 TiQ2_K · 998.6 t/s
- NVIDIA RTX 5060 Ti 16GBQ2_K · 499.3 t/s
- NVIDIA RTX 4090NVFP4 · 739.2 t/s
- NVIDIA RTX 4080Q2_K · 799.1 t/s
- NVIDIA RTX 4060 Ti 16GBQ2_K · 321 t/s
- NVIDIA RTX 3090NVFP4 · 686.4 t/s
- NVIDIA RTX 3090 TiNVFP4 · 739.2 t/s
- NVIDIA H100 80GBBF16 · 614.2 t/s
- NVIDIA A100 80GBBF16 · 373.8 t/s
- NVIDIA A100 40GBNVFP4 · 1140.3 t/s
- NVIDIA L40SNVFP4 · 633.6 t/s
- NVIDIA RTX A6000NVFP4 · 563.2 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 234.7 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 316.8 t/s
- NVIDIA RTX 5000 AdaNVFP4 · 422.4 t/s
- NVIDIA RTX 6000 AdaNVFP4 · 704 t/s
- NVIDIA RTX Pro 6000BF16 · 246.4 t/s
- NVIDIA DGX Spark (128GB)BF16 · 50.1 t/s
- AMD Radeon RX 7900 XTXQ5_K_M · 546.6 t/s
- AMD Radeon RX 7900 XTQ3_K_M · 682.2 t/s
- AMD Radeon RX 7900 GREQ2_K · 641.9 t/s
- AMD Radeon RX 6800 XTQ2_K · 570.6 t/s
- AMD Radeon PRO W7800Q6_K · 257.6 t/s
- AMD Radeon PRO W7900Q8_0 · 316.8 t/s
- AMD Instinct MI300XFP32 · 485.8 t/s
- AMD Radeon AI Pro 9700 32GBQ6_K · 286.2 t/s
- AMD Strix Halo (128GB)BF16 · 46.9 t/s
- AMD Strix Halo (96GB)BF16 · 46.9 t/s
- AMD Strix Halo (64GB)Q8_0 · 93.9 t/s
- Apple M5 Max (128GB)BF16 · 112.6 t/s
- Apple M5 Max (64GB)Q8_0 · 225.1 t/s
- Apple M5 Max (48GB)Q8_0 · 225.1 t/s
- Apple M5 Pro (48GB)Q8_0 · 112.6 t/s
- Apple M5 Pro (36GB)Q6_K · 137.3 t/s
- Apple M5 Pro (24GB)Q4_K_M · 199.9 t/s
- Apple M5 (32GB)Q5_K_M · 87.1 t/s
- Apple M5 (16GB)Q2_K · 170.5 t/s
- Apple M4 Ultra (384GB)FP32 · 100.1 t/s
- Apple M4 Ultra (192GB)FP32 · 100.1 t/s
- Apple M4 Max (128GB)BF16 · 100.1 t/s
- Apple M4 Max (96GB)BF16 · 100.1 t/s
- Apple M4 Max (64GB)Q8_0 · 200.2 t/s
- Apple M4 Max (48GB)Q8_0 · 200.2 t/s
- Apple M4 Pro (48GB)Q8_0 · 100.1 t/s
- Apple M4 Pro (24GB)Q4_K_M · 177.8 t/s
- Apple M4 (32GB)Q5_K_M · 68.3 t/s
- Apple M4 (16GB)Q2_K · 133.7 t/s
- Apple M3 Ultra (512GB)FP32 · 75.1 t/s
- Apple M3 Ultra (256GB)FP32 · 75.1 t/s
- Apple M3 Ultra (96GB)BF16 · 150.2 t/s
- Apple M3 Max (128GB)BF16 · 73.3 t/s
- Apple M3 Max (96GB)BF16 · 73.3 t/s
- Apple M3 Max (64GB)Q8_0 · 146.7 t/s
- Apple M3 Max (48GB)Q8_0 · 146.7 t/s
- Apple M3 Max (36GB)Q6_K · 178.9 t/s
- Apple M3 Pro (36GB)Q6_K · 67.1 t/s
- Apple M3 Pro (18GB)Q2_K · 167.2 t/s
- Apple M3 (24GB)Q4_K_M · 65.1 t/s
- Apple M3 (16GB)Q2_K · 111.4 t/s
- Apple M2 Ultra (384GB)FP32 · 73.3 t/s
- Apple M2 Ultra (192GB)FP32 · 73.3 t/s
- Apple M2 Max (96GB)BF16 · 73.3 t/s
- Apple M2 Max (64GB)Q8_0 · 146.7 t/s
- Apple M2 Max (32GB)Q5_K_M · 227.7 t/s
- Apple M2 Pro (32GB)Q5_K_M · 113.9 t/s
- Apple M2 Pro (16GB)Q2_K · 222.9 t/s
- Apple M2 (24GB)Q4_K_M · 65.1 t/s
- Apple M2 (16GB)Q2_K · 111.4 t/s
- Apple M1 Ultra (128GB)BF16 · 146.7 t/s
- Apple M1 Ultra (64GB)Q8_0 · 293.3 t/s
- Apple M1 Max (64GB)Q8_0 · 146.7 t/s
- Apple M1 Max (32GB)Q5_K_M · 227.7 t/s
- Apple M1 Pro (32GB)Q5_K_M · 113.9 t/s
- Apple M1 Pro (16GB)Q2_K · 222.9 t/s
- Apple M1 (16GB)Q2_K · 75.8 t/s
- Intel Arc Pro B70 24GBQ5_K_M · 259.6 t/s
- Intel Arc Pro B60 24GBQ5_K_M · 216.4 t/s
- Intel Arc A770 16GBQ2_K · 624.1 t/s
- Intel Data Center GPU Max 1550BF16 · 600.6 t/s
- Intel Data Center GPU Max 1100Q8_0 · 450.6 t/s
- Intel Arc 140V (32GB)Q5_K_M · 78 t/s
- Intel Arc 140V (16GB)Q2_K · 152.7 t/s
- Intel Arc 130V (16GB)Q2_K · 152.7 t/s
Plus 19 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5070NVFP4 · 112 t/s
- NVIDIA RTX 5060NVFP4 · 74.7 t/s
- NVIDIA RTX 5050NVFP4 · 53.3 t/s
- NVIDIA RTX 4070 TiNVFP4 · 84 t/s
- NVIDIA RTX 4070NVFP4 · 84 t/s
- NVIDIA RTX 4060NVFP4 · 45.3 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 126.7 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 60 t/s
- Intel Arc B580 12GBQ8_0 · 38 t/s
- Intel Arc B570 10GBQ8_0 · 31.7 t/s
- Intel Arc A770 8GBQ6_K · 52 t/s
- Intel Arc A750 8GBQ6_K · 52 t/s
- Intel Arc A580 8GBQ6_K · 52 t/s
- Intel Arc A380 6GBQ6_K · 18.9 t/s
- Intel Arc A310 4GBQ6_K · 12.6 t/s
- Intel Arc Pro A60 12GBQ8_0 · 32 t/s
- Intel Arc Pro A50 6GBQ6_K · 19.5 t/s
- Intel Arc Pro A40 6GBQ6_K · 19.5 t/s
- CPU only (system RAM)Q5_K_M · 5.4 t/s
Notes
30B total, only 3B active per token — fast inference when it fits.
Compare Qwen3 30B-A3B (MoE) with other models
Frequently asked questions
- What are the VRAM requirements for Qwen3 30B-A3B (MoE)?
- Qwen3 30B-A3B (MoE) requires approximately 19.8 GB of VRAM at Q4_K_M quantization, 34.5 GB at Q8, and 68.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Qwen3 30B-A3B (MoE) have?
- Qwen3 30B-A3B (MoE) has 30 billion total parameters, but only 3 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is Qwen3 30B-A3B (MoE)?
- With an MMLU-Pro score of 61.49, Qwen3 30B-A3B (MoE) delivers solid general-purpose performance suitable for most everyday tasks and professional use.
- Can Qwen3 30B-A3B (MoE) run on a 16 GB GPU?
- No. At Q4_K_M, Qwen3 30B-A3B (MoE) needs 19.8 GB of VRAM — more than 16 GB. You will need a 24 GB GPU like the RTX 4090 or RTX 3090.
- Can Qwen3 30B-A3B (MoE) run on a 24 GB GPU?
- Yes. Qwen3 30B-A3B (MoE) fits in a 24 GB GPU at Q4_K_M, requiring 19.8 GB VRAM. GPUs with 24 GB include the RTX 4090, RTX 3090, and RTX 3090 Ti.
- What is the smallest quantization for Qwen3 30B-A3B (MoE) that fits in 24 GB of VRAM?
- At NVFP4, Qwen3 30B-A3B (MoE) needs 17.7 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Qwen3 30B-A3B (MoE) locally?
- A 24 GB GPU is the minimum. At Q4_K_M, Qwen3 30B-A3B (MoE) needs 19.8 GB VRAM. Good options: RTX 4090 (24 GB), RTX 3090 (24 GB).