Llama 3.1 405B Instruct
Llama 3.1 405B Instruct needs roughly 260.1 GB VRAM at Q4_K_M quantization (911.9 GB at FP16). 7 GPUs we track can run it fully in VRAM at 8k context.
7 GPUs run this natively · 0 with CPU offload
Llama 3.1 405B Instruct is a 405B parameter dense model developed by Meta. Meta's largest and most capable model released July 2024. With 405B parameters and 128K context, it matches GPT-4 on benchmark evaluations.
To run Llama 3.1 405B Instruct locally: Not realistically runnable locally — even Q4 quantization needs ~200-220GB VRAM. Best accessed via API or cloud inference. Q2_K (~110GB) is the only option for extreme workstation builds.
MMLU ~89%, GPQA Diamond ~71%, and strong HumanEval performance make it frontier-class for STEM and multi-step reasoning.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 1620.0 GB | 4.23 GB | 1819.1 GB |
| BF16 | 810.0 GB | 4.23 GB | 911.9 GB |
| FP16 | 810.0 GB | 4.23 GB | 911.9 GB |
| Q8_0 | 405.0 GB | 4.23 GB | 458.3 GB |
| Q6_K | 332.1 GB | 4.23 GB | 376.7 GB |
| Q5_K_M | 260.8 GB | 4.23 GB | 296.9 GB |
| Q4_K_Mrec | 228.0 GB | 4.23 GB | 260.1 GB |
| Q3_K_M | 174.2 GB | 4.23 GB | 199.8 GB |
| Q2_K | 133.3 GB | 4.23 GB | 154.0 GB |
| NVFP4cuda | 202.5 GB | 4.23 GB | 231.5 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Llama 3.1 405B Instruct natively (7)
- AMD Instinct MI300XQ2_K · 39.8 t/s
- Apple M4 Ultra (384GB)Q6_K · 3.3 t/s
- Apple M4 Ultra (192GB)Q2_K · 8.2 t/s
- Apple M3 Ultra (512GB)Q8_0 · 2 t/s
- Apple M3 Ultra (256GB)Q3_K_M · 4.7 t/s
- Apple M2 Ultra (384GB)Q6_K · 2.4 t/s
- Apple M2 Ultra (192GB)Q2_K · 6 t/s
Notes
Frontier-class open weights. Realistically needs a multi-GPU server.
Frequently asked questions
- What are the VRAM requirements for Llama 3.1 405B Instruct?
- Llama 3.1 405B Instruct requires approximately 260.1 GB of VRAM at Q4_K_M quantization, 458.3 GB at Q8, and 911.9 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Llama 3.1 405B Instruct have?
- Llama 3.1 405B Instruct has 405 billion parameters.
- How capable is Llama 3.1 405B Instruct?
- Llama 3.1 405B Instruct achieves an MMLU-Pro score of 73.3, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can Llama 3.1 405B Instruct run on a 16 GB GPU?
- No. At Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can Llama 3.1 405B Instruct run on a 24 GB GPU?
- No. Even at Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for Llama 3.1 405B Instruct that fits in 24 GB of VRAM?
- Llama 3.1 405B Instruct cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 154.0 GB at Q2_K.
- What GPU do I need to run Llama 3.1 405B Instruct locally?
- You need a multi-GPU server. At Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.