CanItRun Logocanitrun.

Qwen3 30B-A3B (MoE)

Qwen3 30B-A3B (MoE) needs roughly 19.8 GB VRAM at Q4_K_M quantization (68.1 GB at FP16). 85 GPUs we track can run it fully in VRAM at 8k context.

85 GPUs run this natively · 19 with CPU offload

Alibaba30B params3B active (MoE)128k contextApache 2.0Commercial use ok

Qwen3 30B-A3B (MoE) is a Mixture of Experts (MoE) model with 30B total parameters but only 3B active per token developed by Alibaba. Ultra-efficient MoE with 30B total parameters but only 3B active per token.

To run Qwen3 30B-A3B (MoE) locally: Q4_K_M needs ~18-20GB — runs on 24GB GPUs with excellent speed due to low active parameter count. As a MoE model, inference speed depends on active parameters (3B) rather than total size.

MoE architecture delivers 30B-class quality at 3B inference cost — exceptional tokens/sec when it fits.

VRAM at each quantization

Assumes 8k context. KV cache grows linearly with context length.

QuantWeightsKV cacheTotal
FP32120.0 GB0.81 GB135.3 GB
BF1660.0 GB0.81 GB68.1 GB
FP1660.0 GB0.81 GB68.1 GB
Q8_030.0 GB0.81 GB34.5 GB
Q6_K24.6 GB0.81 GB28.4 GB
Q5_K_M19.3 GB0.81 GB22.5 GB
Q4_K_Mrec16.9 GB0.81 GB19.8 GB
Q3_K_M12.9 GB0.81 GB15.3 GB
Q2_K9.9 GB0.81 GB12.0 GB
NVFP4cuda15.0 GB0.81 GB17.7 GB

KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.

Benchmarks

GPUs that run Qwen3 30B-A3B (MoE) natively (85)

Plus 19 GPUs that run it with CPU offload (slower)

Notes

30B total, only 3B active per token — fast inference when it fits.

Hugging Face ↗Ollama ↗Released 2025-04-29

Compare Qwen3 30B-A3B (MoE) with other models

Frequently asked questions

What are the VRAM requirements for Qwen3 30B-A3B (MoE)?
Qwen3 30B-A3B (MoE) requires approximately 19.8 GB of VRAM at Q4_K_M quantization, 34.5 GB at Q8, and 68.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
How many parameters does Qwen3 30B-A3B (MoE) have?
Qwen3 30B-A3B (MoE) has 30 billion total parameters, but only 3 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
How capable is Qwen3 30B-A3B (MoE)?
With an MMLU-Pro score of 61.49, Qwen3 30B-A3B (MoE) delivers solid general-purpose performance suitable for most everyday tasks and professional use.
Can Qwen3 30B-A3B (MoE) run on a 16 GB GPU?
No. At Q4_K_M, Qwen3 30B-A3B (MoE) needs 19.8 GB of VRAM — more than 16 GB. You will need a 24 GB GPU like the RTX 4090 or RTX 3090.
Can Qwen3 30B-A3B (MoE) run on a 24 GB GPU?
Yes. Qwen3 30B-A3B (MoE) fits in a 24 GB GPU at Q4_K_M, requiring 19.8 GB VRAM. GPUs with 24 GB include the RTX 4090, RTX 3090, and RTX 3090 Ti.
What is the smallest quantization for Qwen3 30B-A3B (MoE) that fits in 24 GB of VRAM?
At NVFP4, Qwen3 30B-A3B (MoE) needs 17.7 GB — the highest-quality quantization that fits in 24 GB of VRAM.
What GPU do I need to run Qwen3 30B-A3B (MoE) locally?
A 24 GB GPU is the minimum. At Q4_K_M, Qwen3 30B-A3B (MoE) needs 19.8 GB VRAM. Good options: RTX 4090 (24 GB), RTX 3090 (24 GB).