DeepSeek R1 Distill Llama 8B
DeepSeek R1 Distill Llama 8B needs roughly 6.3 GB VRAM at Q4_K_M quantization (19.1 GB at FP16). 105 GPUs we track can run it fully in VRAM at 8k context.
105 GPUs run this natively · 2 with CPU offload
DeepSeek8B params125k contextMITCommercial use ok
DeepSeek R1 Distill Llama 8B is a 8B parameter dense model developed by DeepSeek. Compact 8B distillation bringing R1 reasoning to edge devices.
To run DeepSeek R1 Distill Llama 8B locally: Q5_K_M ~6GB — runs on 8GB GPUs. Best reasoning model for budget hardware.
MMLU-Pro 41.0%, GPQA 49.0%, Math 89.1% — exceptional reasoning for the size.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 32.0 GB | 1.07 GB | 37.0 GB |
| BF16 | 16.0 GB | 1.07 GB | 19.1 GB |
| FP16 | 16.0 GB | 1.07 GB | 19.1 GB |
| Q8_0 | 8.0 GB | 1.07 GB | 10.2 GB |
| Q6_K | 6.6 GB | 1.07 GB | 8.6 GB |
| Q5_K_Mrec | 5.2 GB | 1.07 GB | 7.0 GB |
| Q4_K_M | 4.5 GB | 1.07 GB | 6.3 GB |
| Q3_K_M | 3.4 GB | 1.07 GB | 5.1 GB |
| Q2_K | 2.6 GB | 1.07 GB | 4.2 GB |
| NVFP4cuda | 4.0 GB | 1.07 GB | 5.7 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run DeepSeek R1 Distill Llama 8B natively (105)
- NVIDIA RTX 5090BF16 · 112 t/s
- NVIDIA RTX 5080NVFP4 · 240 t/s
- NVIDIA RTX 5070 TiNVFP4 · 224 t/s
- NVIDIA RTX 5070NVFP4 · 168 t/s
- NVIDIA RTX 5060 Ti 16GBNVFP4 · 112 t/s
- NVIDIA RTX 5060NVFP4 · 112 t/s
- NVIDIA RTX 5050NVFP4 · 80 t/s
- NVIDIA RTX 4090BF16 · 63 t/s
- NVIDIA RTX 4080NVFP4 · 179.3 t/s
- NVIDIA RTX 4070 TiNVFP4 · 126 t/s
- NVIDIA RTX 4070NVFP4 · 126 t/s
- NVIDIA RTX 4060 Ti 16GBNVFP4 · 72 t/s
- NVIDIA RTX 4060NVFP4 · 68 t/s
- NVIDIA RTX 3090BF16 · 58.5 t/s
- NVIDIA RTX 3090 TiBF16 · 63 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 190 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 90 t/s
- NVIDIA H100 80GBFP32 · 104.7 t/s
- NVIDIA A100 80GBFP32 · 63.7 t/s
- NVIDIA A100 40GBFP32 · 48.6 t/s
- NVIDIA L40SFP32 · 27 t/s
- NVIDIA RTX A6000FP32 · 24 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 80 t/s
- NVIDIA RTX 4500 AdaBF16 · 27 t/s
- NVIDIA RTX 5000 AdaBF16 · 36 t/s
- NVIDIA RTX 6000 AdaFP32 · 30 t/s
- NVIDIA RTX Pro 6000FP32 · 42 t/s
- NVIDIA DGX Spark (128GB)FP32 · 8.5 t/s
- AMD Radeon RX 7900 XTXBF16 · 60 t/s
- AMD Radeon RX 7900 XTQ8_0 · 100 t/s
- AMD Radeon RX 7900 GREQ8_0 · 72 t/s
- AMD Radeon RX 6800 XTQ8_0 · 64 t/s
- AMD Radeon PRO W7800BF16 · 36 t/s
- AMD Radeon PRO W7900FP32 · 27 t/s
- AMD Instinct MI300XFP32 · 165.6 t/s
- AMD Radeon AI Pro 9700 32GBBF16 · 40 t/s
- AMD Strix Halo (128GB)FP32 · 8 t/s
- AMD Strix Halo (96GB)FP32 · 8 t/s
- AMD Strix Halo (64GB)FP32 · 8 t/s
- Apple M5 Max (128GB)FP32 · 19.2 t/s
- Apple M5 Max (64GB)FP32 · 19.2 t/s
- Apple M5 Max (48GB)FP32 · 19.2 t/s
- Apple M5 Pro (48GB)FP32 · 9.6 t/s
- Apple M5 Pro (36GB)BF16 · 19.2 t/s
- Apple M5 Pro (24GB)BF16 · 19.2 t/s
- Apple M5 (32GB)BF16 · 9.6 t/s
- Apple M5 (16GB)Q8_0 · 19.1 t/s
- Apple M4 Ultra (384GB)FP32 · 34.1 t/s
- Apple M4 Ultra (192GB)FP32 · 34.1 t/s
- Apple M4 Max (128GB)FP32 · 17.1 t/s
- Apple M4 Max (96GB)FP32 · 17.1 t/s
- Apple M4 Max (64GB)FP32 · 17.1 t/s
- Apple M4 Max (48GB)FP32 · 17.1 t/s
- Apple M4 Pro (48GB)FP32 · 8.5 t/s
- Apple M4 Pro (24GB)BF16 · 17.1 t/s
- Apple M4 (32GB)BF16 · 7.5 t/s
- Apple M4 (16GB)Q8_0 · 15 t/s
- Apple M3 Ultra (512GB)FP32 · 25.6 t/s
- Apple M3 Ultra (256GB)FP32 · 25.6 t/s
- Apple M3 Ultra (96GB)FP32 · 25.6 t/s
- Apple M3 Max (128GB)FP32 · 12.5 t/s
- Apple M3 Max (96GB)FP32 · 12.5 t/s
- Apple M3 Max (64GB)FP32 · 12.5 t/s
- Apple M3 Max (48GB)FP32 · 12.5 t/s
- Apple M3 Max (36GB)BF16 · 25 t/s
- Apple M3 Pro (36GB)BF16 · 9.4 t/s
- Apple M3 Pro (18GB)Q8_0 · 18.8 t/s
- Apple M3 (24GB)BF16 · 6.3 t/s
- Apple M3 (16GB)Q8_0 · 12.5 t/s
- Apple M3 (8GB)Q3_K_M · 29.1 t/s
- Apple M2 Ultra (384GB)FP32 · 25 t/s
- Apple M2 Ultra (192GB)FP32 · 25 t/s
- Apple M2 Max (96GB)FP32 · 12.5 t/s
- Apple M2 Max (64GB)FP32 · 12.5 t/s
- Apple M2 Max (32GB)BF16 · 25 t/s
- Apple M2 Pro (32GB)BF16 · 12.5 t/s
- Apple M2 Pro (16GB)Q8_0 · 25 t/s
- Apple M2 (24GB)BF16 · 6.3 t/s
- Apple M2 (16GB)Q8_0 · 12.5 t/s
- Apple M2 (8GB)Q3_K_M · 29.1 t/s
- Apple M1 Ultra (128GB)FP32 · 25 t/s
- Apple M1 Ultra (64GB)FP32 · 25 t/s
- Apple M1 Max (64GB)FP32 · 12.5 t/s
- Apple M1 Max (32GB)BF16 · 25 t/s
- Apple M1 Pro (32GB)BF16 · 12.5 t/s
- Apple M1 Pro (16GB)Q8_0 · 25 t/s
- Apple M1 (16GB)Q8_0 · 8.5 t/s
- Apple M1 (8GB)Q3_K_M · 19.8 t/s
- Intel Arc B580 12GBQ8_0 · 57 t/s
- Intel Arc B570 10GBQ6_K · 57.9 t/s
- Intel Arc Pro B70 24GBBF16 · 28.5 t/s
- Intel Arc Pro B60 24GBBF16 · 23.8 t/s
- Intel Arc A770 16GBQ8_0 · 70 t/s
- Intel Arc A770 8GBQ5_K_M · 99.4 t/s
- Intel Arc A750 8GBQ5_K_M · 99.4 t/s
- Intel Arc A580 8GBQ5_K_M · 99.4 t/s
- Intel Arc A380 6GBQ3_K_M · 54.1 t/s
- Intel Arc Pro A60 12GBQ8_0 · 48 t/s
- Intel Arc Pro A50 6GBQ3_K_M · 55.8 t/s
- Intel Arc Pro A40 6GBQ3_K_M · 55.8 t/s
- Intel Data Center GPU Max 1550FP32 · 102.4 t/s
- Intel Data Center GPU Max 1100FP32 · 38.4 t/s
- Intel Arc 140V (32GB)BF16 · 8.6 t/s
- Intel Arc 140V (16GB)Q8_0 · 17.1 t/s
- Intel Arc 130V (16GB)Q8_0 · 17.1 t/s
Plus 2 GPUs that run it with CPU offload (slower)
- Intel Arc A310 4GBBF16 · 1.9 t/s
- CPU only (system RAM)BF16 · 0.7 t/s
Compare DeepSeek R1 Distill Llama 8B with other models
Frequently asked questions
- What are the VRAM requirements for DeepSeek R1 Distill Llama 8B?
- DeepSeek R1 Distill Llama 8B requires approximately 6.2 GB of VRAM at Q4_K_M quantization, 10.2 GB at Q8, and 19.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does DeepSeek R1 Distill Llama 8B have?
- DeepSeek R1 Distill Llama 8B has 8 billion parameters.
- Is DeepSeek R1 Distill Llama 8B good at reasoning and math?
- Yes. With a MATH score of 89.1 and MMLU-Pro of 41, DeepSeek R1 Distill Llama 8B handles complex multi-step reasoning, analytical tasks, and problem-solving well.
- Can DeepSeek R1 Distill Llama 8B run on a 16 GB GPU?
- Yes. DeepSeek R1 Distill Llama 8B needs 6.2 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for DeepSeek R1 Distill Llama 8B that fits in 24 GB of VRAM?
- At BF16, DeepSeek R1 Distill Llama 8B needs 19.1 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run DeepSeek R1 Distill Llama 8B locally?
- A 16 GB GPU is enough. At Q4_K_M, DeepSeek R1 Distill Llama 8B needs 6.2 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).