DeepSeek V3 671B
DeepSeek V3 671B needs roughly 423.7 GB VRAM at Q4_K_M quantization (1503.6 GB at FP16). 4 GPUs we track can run it fully in VRAM at 8k context.
4 GPUs run this natively · 0 with CPU offload
DeepSeek V3 671B is a Mixture of Experts (MoE) model with 671B total parameters but only 37B active per token developed by DeepSeek. December 2024 671B MoE with Multi-Head Latent Attention (MLA) — 16× smaller KV cache than standard GQA.
To run DeepSeek V3 671B locally: Full model needs ~300GB+ at Q4. Q2_K (~150GB) is the only local option for extreme builds. MLA helps but doesn't solve the VRAM problem. As a MoE model, inference speed depends on active parameters (37B) rather than total size.
MMLU-Pro 75.9%, Math 90.2% — frontier-class quality with efficient attention.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 2684.0 GB | 0.51 GB | 3006.7 GB |
| BF16 | 1342.0 GB | 0.51 GB | 1503.6 GB |
| FP16 | 1342.0 GB | 0.51 GB | 1503.6 GB |
| Q8_0 | 671.0 GB | 0.51 GB | 752.1 GB |
| Q6_K | 550.2 GB | 0.51 GB | 616.8 GB |
| Q5_K_M | 432.1 GB | 0.51 GB | 484.6 GB |
| Q4_K_M | 377.8 GB | 0.51 GB | 423.7 GB |
| Q3_K_M | 288.5 GB | 0.51 GB | 323.7 GB |
| Q2_Krec | 220.8 GB | 0.51 GB | 247.8 GB |
| NVFP4cuda | 335.5 GB | 0.51 GB | 376.3 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run DeepSeek V3 671B natively (4)
- Apple M4 Ultra (384GB)Q3_K_M · 75.5 t/s
- Apple M3 Ultra (512GB)Q5_K_M · 37.8 t/s
- Apple M3 Ultra (256GB)Q2_K · 74 t/s
- Apple M2 Ultra (384GB)Q3_K_M · 55.3 t/s
Notes
MoE with MLA attention — KV cache is ~16× smaller than standard GQA; kvHeads/headDim here approximates MLA storage.
Compare DeepSeek V3 671B with other models
Frequently asked questions
- What are the VRAM requirements for DeepSeek V3 671B?
- DeepSeek V3 671B requires approximately 423.7 GB of VRAM at Q4_K_M quantization, 752.1 GB at Q8, and 1503.6 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does DeepSeek V3 671B have?
- DeepSeek V3 671B has 671 billion total parameters, but only 37 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is DeepSeek V3 671B?
- DeepSeek V3 671B achieves an MMLU-Pro score of 75.9, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can DeepSeek V3 671B run on a 16 GB GPU?
- No. At Q4_K_M, DeepSeek V3 671B needs 423.7 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can DeepSeek V3 671B run on a 24 GB GPU?
- No. Even at Q4_K_M, DeepSeek V3 671B needs 423.7 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for DeepSeek V3 671B that fits in 24 GB of VRAM?
- DeepSeek V3 671B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 247.8 GB at Q2_K.
- What GPU do I need to run DeepSeek V3 671B locally?
- You need a multi-GPU server. At Q4_K_M, DeepSeek V3 671B needs 423.7 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.