Llama 4 Scout 109B
Llama 4 Scout 109B needs roughly 54.5GB VRAM at Q4 quantization (218.0GB at FP16). 24 GPUs we track can run it fully in VRAM at 8k context.
Meta109B params17B active (MoE)9766k contextLlama 4 CommunityCommercial use ok
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP16 | 218.0 GB | 2.68 GB | 247.2 GB |
| Q8 | 109.0 GB | 2.68 GB | 125.1 GB |
| Q6_K | 81.8 GB | 2.68 GB | 94.6 GB |
| Q5_K_M | 68.1 GB | 2.68 GB | 79.3 GB |
| Q4_K_M | 54.5 GB | 2.68 GB | 64.0 GB |
| Q3_K_M | 43.6 GB | 2.68 GB | 51.8 GB |
| Q2_K | 32.7 GB | 2.68 GB | 39.6 GB |
Benchmarks
GPUs that run Llama 4 Scout 109B natively (24)
- NVIDIA H100 80GBQ4_K_M · 433.5 t/s
- NVIDIA A100 80GBQ4_K_M · 263.9 t/s
- NVIDIA L40SQ2_K · 186.4 t/s
- NVIDIA RTX A6000Q2_K · 165.6 t/s
- NVIDIA RTX 6000 AdaQ2_K · 207.1 t/s
- AMD Instinct MI300XQ8 · 342.9 t/s
- Apple M4 Ultra (384GB)FP16 · 35.3 t/s
- Apple M4 Ultra (192GB)Q8 · 70.7 t/s
- Apple M4 Max (128GB)Q6_K · 47.1 t/s
- Apple M4 Max (96GB)Q5_K_M · 56.5 t/s
- Apple M4 Max (64GB)Q3_K_M · 88.3 t/s
- Apple M4 Max (48GB)Q2_K · 117.8 t/s
- Apple M4 Pro (48GB)Q2_K · 58.9 t/s
- Apple M3 Max (128GB)Q6_K · 34.5 t/s
- Apple M3 Max (96GB)Q5_K_M · 41.4 t/s
- Apple M3 Max (64GB)Q3_K_M · 64.7 t/s
- Apple M3 Max (48GB)Q2_K · 86.3 t/s
- Apple M2 Ultra (384GB)FP16 · 25.9 t/s
- Apple M2 Ultra (192GB)Q8 · 51.8 t/s
- Apple M2 Max (96GB)Q5_K_M · 41.4 t/s
- Apple M2 Max (64GB)Q3_K_M · 64.7 t/s
- Apple M1 Ultra (128GB)Q6_K · 69 t/s
- Apple M1 Ultra (64GB)Q3_K_M · 129.4 t/s
- Apple M1 Max (64GB)Q3_K_M · 64.7 t/s
Plus 11 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5090Q3_K_M · 65.9 t/s
- NVIDIA RTX 4090Q2_K · 49.4 t/s
- NVIDIA RTX 4080Q2_K · 35.1 t/s
- NVIDIA RTX 4060 Ti 16GBQ2_K · 14.1 t/s
- NVIDIA RTX 3090Q2_K · 45.9 t/s
- NVIDIA RTX 3090 TiQ2_K · 49.4 t/s
- NVIDIA A100 40GBQ3_K_M · 57.2 t/s
- AMD Radeon RX 7900 XTXQ2_K · 47.1 t/s
- AMD Radeon RX 7900 XTQ2_K · 39.2 t/s
- AMD Radeon RX 6800 XTQ2_K · 25.1 t/s
- Intel Arc A770 16GBQ2_K · 27.5 t/s
Notes
16 experts, 2 active. 10M context; KV cache limits practical context to much less.
