DeepSeek R1 Distill Llama 70B
DeepSeek R1 Distill Llama 70B needs roughly 47.1 GB VRAM at Q4_K_M quantization (159.8 GB at FP16). 47 GPUs we track can run it fully in VRAM at 8k context.
47 GPUs run this natively · 35 with CPU offload
DeepSeek R1 Distill Llama 70B is a 70B parameter dense model developed by DeepSeek. 70B distillation of DeepSeek-R1's reasoning capabilities into Llama-3.3 architecture.
To run DeepSeek R1 Distill Llama 70B locally: Q4_K_M ~35-40GB — same requirements as Llama-3.3-70B. Best way to get R1-style reasoning locally.
MMLU-Pro 70.0%, GPQA 65.2%, Math 94.5% — inherits R1's reasoning strength at practical size.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 280.0 GB | 2.68 GB | 316.6 GB |
| BF16 | 140.0 GB | 2.68 GB | 159.8 GB |
| FP16 | 140.0 GB | 2.68 GB | 159.8 GB |
| Q8_0 | 70.0 GB | 2.68 GB | 81.4 GB |
| Q6_K | 57.4 GB | 2.68 GB | 67.3 GB |
| Q5_K_M | 45.1 GB | 2.68 GB | 53.5 GB |
| Q4_K_Mrec | 39.4 GB | 2.68 GB | 47.1 GB |
| Q3_K_M | 30.1 GB | 2.68 GB | 36.7 GB |
| Q2_K | 23.0 GB | 2.68 GB | 28.8 GB |
| NVFP4cuda | 35.0 GB | 2.68 GB | 42.2 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run DeepSeek R1 Distill Llama 70B natively (47)
- NVIDIA RTX 5090Q2_K · 77.8 t/s
- NVIDIA H100 80GBNVFP4 · 95.7 t/s
- NVIDIA A100 80GBNVFP4 · 58.3 t/s
- NVIDIA A100 40GBQ3_K_M · 51.7 t/s
- NVIDIA L40SNVFP4 · 24.7 t/s
- NVIDIA RTX A6000NVFP4 · 21.9 t/s
- NVIDIA RTX 5000 AdaQ2_K · 25 t/s
- NVIDIA RTX 6000 AdaNVFP4 · 27.4 t/s
- NVIDIA RTX Pro 6000NVFP4 · 38.4 t/s
- NVIDIA DGX Spark (128GB)NVFP4 · 7.8 t/s
- AMD Radeon PRO W7800Q2_K · 25 t/s
- AMD Radeon PRO W7900Q3_K_M · 28.7 t/s
- AMD Instinct MI300XBF16 · 37.9 t/s
- AMD Radeon AI Pro 9700 32GBQ2_K · 27.8 t/s
- AMD Strix Halo (128GB)Q8_0 · 3.7 t/s
- AMD Strix Halo (96GB)Q8_0 · 3.7 t/s
- AMD Strix Halo (64GB)Q5_K_M · 5.7 t/s
- Apple M5 Max (128GB)Q8_0 · 8.8 t/s
- Apple M5 Max (64GB)Q5_K_M · 13.6 t/s
- Apple M5 Max (48GB)Q3_K_M · 20.4 t/s
- Apple M5 Pro (48GB)Q3_K_M · 10.2 t/s
- Apple M5 Pro (36GB)Q2_K · 13.3 t/s
- Apple M4 Ultra (384GB)FP32 · 3.9 t/s
- Apple M4 Ultra (192GB)BF16 · 7.8 t/s
- Apple M4 Max (128GB)Q8_0 · 7.8 t/s
- Apple M4 Max (96GB)Q8_0 · 7.8 t/s
- Apple M4 Max (64GB)Q5_K_M · 12.1 t/s
- Apple M4 Max (48GB)Q3_K_M · 18.1 t/s
- Apple M4 Pro (48GB)Q3_K_M · 9.1 t/s
- Apple M3 Ultra (512GB)FP32 · 2.9 t/s
- Apple M3 Ultra (256GB)BF16 · 5.9 t/s
- Apple M3 Ultra (96GB)Q8_0 · 11.7 t/s
- Apple M3 Max (128GB)Q8_0 · 5.7 t/s
- Apple M3 Max (96GB)Q8_0 · 5.7 t/s
- Apple M3 Max (64GB)Q5_K_M · 8.9 t/s
- Apple M3 Max (48GB)Q3_K_M · 13.3 t/s
- Apple M3 Max (36GB)Q2_K · 17.4 t/s
- Apple M3 Pro (36GB)Q2_K · 6.5 t/s
- Apple M2 Ultra (384GB)FP32 · 2.9 t/s
- Apple M2 Ultra (192GB)BF16 · 5.7 t/s
- Apple M2 Max (96GB)Q8_0 · 5.7 t/s
- Apple M2 Max (64GB)Q5_K_M · 8.9 t/s
- Apple M1 Ultra (128GB)Q8_0 · 11.4 t/s
- Apple M1 Ultra (64GB)Q5_K_M · 17.7 t/s
- Apple M1 Max (64GB)Q5_K_M · 8.9 t/s
- Intel Data Center GPU Max 1550Q8_0 · 46.8 t/s
- Intel Data Center GPU Max 1100Q3_K_M · 40.8 t/s
Plus 35 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5080Q3_K_M · 8 t/s
- NVIDIA RTX 5070 TiQ3_K_M · 7.4 t/s
- NVIDIA RTX 5070Q3_K_M · 5.6 t/s
- NVIDIA RTX 5060 Ti 16GBQ3_K_M · 3.7 t/s
- NVIDIA RTX 5060Q2_K · 4.9 t/s
- NVIDIA RTX 5050Q2_K · 3.5 t/s
- NVIDIA RTX 4090NVFP4 · 7.2 t/s
- NVIDIA RTX 4080Q3_K_M · 6 t/s
- NVIDIA RTX 4070 TiQ3_K_M · 4.2 t/s
- NVIDIA RTX 4070Q3_K_M · 4.2 t/s
- NVIDIA RTX 4060 Ti 16GBQ3_K_M · 2.4 t/s
- NVIDIA RTX 4060Q2_K · 3 t/s
- NVIDIA RTX 3090NVFP4 · 6.7 t/s
- NVIDIA RTX 3090 TiNVFP4 · 7.2 t/s
- NVIDIA RTX 3080 10GBQ2_K · 8.3 t/s
- NVIDIA RTX 3060 12GBQ3_K_M · 3 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 2.3 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 3.1 t/s
- AMD Radeon RX 7900 XTXQ4_K_M · 6.1 t/s
- AMD Radeon RX 7900 XTQ3_K_M · 6.6 t/s
- AMD Radeon RX 7900 GREQ3_K_M · 4.8 t/s
- AMD Radeon RX 6800 XTQ3_K_M · 4.3 t/s
- Intel Arc B580 12GBQ3_K_M · 3.8 t/s
- Intel Arc B570 10GBQ2_K · 4.1 t/s
- Intel Arc Pro B70 24GBQ4_K_M · 2.9 t/s
- Intel Arc Pro B60 24GBQ4_K_M · 2.4 t/s
- Intel Arc A770 16GBQ3_K_M · 4.7 t/s
- Intel Arc A770 8GBQ2_K · 5.6 t/s
- Intel Arc A750 8GBQ2_K · 5.6 t/s
- Intel Arc A580 8GBQ2_K · 5.6 t/s
- Intel Arc A380 6GBQ2_K · 2 t/s
- Intel Arc A310 4GBQ2_K · 1.3 t/s
- Intel Arc Pro A60 12GBQ3_K_M · 3.2 t/s
- Intel Arc Pro A50 6GBQ2_K · 2.1 t/s
- Intel Arc Pro A40 6GBQ2_K · 2.1 t/s
Notes
Reasoning model — outputs long chains-of-thought before answering.
Compare DeepSeek R1 Distill Llama 70B with other models
Frequently asked questions
- What are the VRAM requirements for DeepSeek R1 Distill Llama 70B?
- DeepSeek R1 Distill Llama 70B requires approximately 47.1 GB of VRAM at Q4_K_M quantization, 81.4 GB at Q8, and 159.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does DeepSeek R1 Distill Llama 70B have?
- DeepSeek R1 Distill Llama 70B has 70 billion parameters.
- Is DeepSeek R1 Distill Llama 70B good at reasoning and math?
- Yes. With a MATH score of 94.5 and MMLU-Pro of 70, DeepSeek R1 Distill Llama 70B handles complex multi-step reasoning, analytical tasks, and problem-solving well.
- Can DeepSeek R1 Distill Llama 70B run on a 16 GB GPU?
- No. At Q4_K_M, DeepSeek R1 Distill Llama 70B needs 47.1 GB of VRAM — more than 16 GB. You will need a 48 GB GPU like the RTX 6000 Ada or a dual-GPU setup.
- Can DeepSeek R1 Distill Llama 70B run on a 24 GB GPU?
- No. Even at Q4_K_M, DeepSeek R1 Distill Llama 70B needs 47.1 GB. Consider a 48 GB card like the RTX 6000 Ada or a dual RTX 4090 setup.
- What is the smallest quantization for DeepSeek R1 Distill Llama 70B that fits in 24 GB of VRAM?
- DeepSeek R1 Distill Llama 70B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 28.8 GB at Q2_K.
- What GPU do I need to run DeepSeek R1 Distill Llama 70B locally?
- You need a 48 GB GPU or a dual-GPU setup. At Q4_K_M, DeepSeek R1 Distill Llama 70B needs 47.1 GB VRAM. Options: RTX 6000 Ada (48 GB), A6000 (48 GB), or 2× RTX 4090.