Llama 4 Maverick 400B
Llama 4 Maverick 400B needs roughly 256.7 GB VRAM at Q4_K_M quantization (900.5 GB at FP16). 7 GPUs we track can run it fully in VRAM at 8k context.
7 GPUs run this natively · 0 with CPU offload
Llama 4 Maverick 400B is a Mixture of Experts (MoE) model with 400B total parameters but only 17B active per token developed by Meta. Meta's 400B parameter MoE flagship with 128 experts and 17B active per token.
To run Llama 4 Maverick 400B locally: Q2_K needs ~150-180GB VRAM — multi-GPU server territory. Not practical for local deployment. As a MoE model, inference speed depends on active parameters (17B) rather than total size.
MMLU-Pro 79.0% rivals closed-source frontier models.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 1600.0 GB | 4.03 GB | 1796.5 GB |
| BF16 | 800.0 GB | 4.03 GB | 900.5 GB |
| FP16 | 800.0 GB | 4.03 GB | 900.5 GB |
| Q8_0 | 400.0 GB | 4.03 GB | 452.5 GB |
| Q6_K | 328.0 GB | 4.03 GB | 371.9 GB |
| Q5_K_M | 257.6 GB | 4.03 GB | 293.0 GB |
| Q4_K_M | 225.2 GB | 4.03 GB | 256.7 GB |
| Q3_K_M | 172.0 GB | 4.03 GB | 197.2 GB |
| Q2_Krec | 131.6 GB | 4.03 GB | 151.9 GB |
| NVFP4cuda | 200.0 GB | 4.03 GB | 228.5 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Llama 4 Maverick 400B natively (7)
- AMD Instinct MI300XQ2_K · 1042.4 t/s
- Apple M4 Ultra (384GB)Q6_K · 86.2 t/s
- Apple M4 Ultra (192GB)Q2_K · 214.8 t/s
- Apple M3 Ultra (512GB)Q8_0 · 53 t/s
- Apple M3 Ultra (256GB)Q3_K_M · 123.2 t/s
- Apple M2 Ultra (384GB)Q6_K · 63.1 t/s
- Apple M2 Ultra (192GB)Q2_K · 157.3 t/s
Notes
128 experts, 2 active. Realistically needs a multi-GPU server.
Compare Llama 4 Maverick 400B with other models
Frequently asked questions
- What are the VRAM requirements for Llama 4 Maverick 400B?
- Llama 4 Maverick 400B requires approximately 256.7 GB of VRAM at Q4_K_M quantization, 452.5 GB at Q8, and 900.5 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Llama 4 Maverick 400B have?
- Llama 4 Maverick 400B has 400 billion total parameters, but only 17 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is Llama 4 Maverick 400B?
- Llama 4 Maverick 400B achieves an MMLU-Pro score of 80.5, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can Llama 4 Maverick 400B run on a 16 GB GPU?
- No. At Q4_K_M, Llama 4 Maverick 400B needs 256.7 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can Llama 4 Maverick 400B run on a 24 GB GPU?
- No. Even at Q4_K_M, Llama 4 Maverick 400B needs 256.7 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for Llama 4 Maverick 400B that fits in 24 GB of VRAM?
- Llama 4 Maverick 400B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 151.9 GB at Q2_K.
- What GPU do I need to run Llama 4 Maverick 400B locally?
- You need a multi-GPU server. At Q4_K_M, Llama 4 Maverick 400B needs 256.7 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.