CanItRun Logocanitrun.

Llama 3.1 405B Instruct

Llama 3.1 405B Instruct needs roughly 260.1 GB VRAM at Q4_K_M quantization (911.9 GB at FP16). 7 GPUs we track can run it fully in VRAM at 8k context.

7 GPUs run this natively · 0 with CPU offload

Meta405B params125k contextLlama 3.1 CommunityCommercial use ok

Llama 3.1 405B Instruct is a 405B parameter dense model developed by Meta. Meta's largest and most capable model released July 2024. With 405B parameters and 128K context, it matches GPT-4 on benchmark evaluations.

To run Llama 3.1 405B Instruct locally: Not realistically runnable locally — even Q4 quantization needs ~200-220GB VRAM. Best accessed via API or cloud inference. Q2_K (~110GB) is the only option for extreme workstation builds.

MMLU ~89%, GPQA Diamond ~71%, and strong HumanEval performance make it frontier-class for STEM and multi-step reasoning.

VRAM at each quantization

Assumes 8k context. KV cache grows linearly with context length.

QuantWeightsKV cacheTotal
FP321620.0 GB4.23 GB1819.1 GB
BF16810.0 GB4.23 GB911.9 GB
FP16810.0 GB4.23 GB911.9 GB
Q8_0405.0 GB4.23 GB458.3 GB
Q6_K332.1 GB4.23 GB376.7 GB
Q5_K_M260.8 GB4.23 GB296.9 GB
Q4_K_Mrec228.0 GB4.23 GB260.1 GB
Q3_K_M174.2 GB4.23 GB199.8 GB
Q2_K133.3 GB4.23 GB154.0 GB
NVFP4cuda202.5 GB4.23 GB231.5 GB

KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.

Benchmarks

GPUs that run Llama 3.1 405B Instruct natively (7)

Notes

Frontier-class open weights. Realistically needs a multi-GPU server.

Hugging Face ↗Released 2024-07-23

Frequently asked questions

What are the VRAM requirements for Llama 3.1 405B Instruct?
Llama 3.1 405B Instruct requires approximately 260.1 GB of VRAM at Q4_K_M quantization, 458.3 GB at Q8, and 911.9 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
How many parameters does Llama 3.1 405B Instruct have?
Llama 3.1 405B Instruct has 405 billion parameters.
How capable is Llama 3.1 405B Instruct?
Llama 3.1 405B Instruct achieves an MMLU-Pro score of 73.3, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
Can Llama 3.1 405B Instruct run on a 16 GB GPU?
No. At Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
Can Llama 3.1 405B Instruct run on a 24 GB GPU?
No. Even at Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
What is the smallest quantization for Llama 3.1 405B Instruct that fits in 24 GB of VRAM?
Llama 3.1 405B Instruct cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 154.0 GB at Q2_K.
What GPU do I need to run Llama 3.1 405B Instruct locally?
You need a multi-GPU server. At Q4_K_M, Llama 3.1 405B Instruct needs 260.1 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.