What are the VRAM requirements for DeepSeek R1 Distill Llama 8B?

DeepSeek R1 Distill Llama 8B requires approximately 6.7 GB of VRAM at Q4_K_M quantization, 10.7 GB at Q8, and 19.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.

How many parameters does DeepSeek R1 Distill Llama 8B have?

DeepSeek R1 Distill Llama 8B has 8 billion parameters.

Is DeepSeek R1 Distill Llama 8B good at reasoning and math?

Yes. With a MATH score of 89.1 and MMLU-Pro of 41, DeepSeek R1 Distill Llama 8B handles complex multi-step reasoning, analytical tasks, and problem-solving well.

Can DeepSeek R1 Distill Llama 8B run on a 16 GB GPU?

Yes. DeepSeek R1 Distill Llama 8B needs 6.7 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.

What is the smallest quantization for DeepSeek R1 Distill Llama 8B that fits in 24 GB of VRAM?

At BF16, DeepSeek R1 Distill Llama 8B needs 19.1 GB — the highest-quality quantization that fits in 24 GB of VRAM.

What GPU do I need to run DeepSeek R1 Distill Llama 8B locally?

A 16 GB GPU is enough. At Q4_K_M, DeepSeek R1 Distill Llama 8B needs 6.7 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).

DeepSeek R1 Distill Llama 8B

Name: DeepSeek R1 Distill Llama 8B
Author: DeepSeek

DeepSeek R1 Distill Llama 8B needs roughly 6.7 GB VRAM at Q4_K_M quantization (19.1 GB at FP16). 102 GPUs we track can run it fully in VRAM at 8k context.

102 GPUs run this natively · 2 with CPU offload

DeepSeek8B params125k contextMITCommercial use ok

DeepSeek R1 Distill Llama 8B is a 8B parameter dense model developed by DeepSeek. Compact 8B distillation bringing R1 reasoning to edge devices.

To run DeepSeek R1 Distill Llama 8B locally: Q5_K_M ~6GB — runs on 8GB GPUs. Best reasoning model for budget hardware.

MMLU-Pro 41.0%, GPQA 49.0%, Math 89.1% — exceptional reasoning for the size.

VRAM at each quantization

Numbers here are computed at 8k context. Because KV cache grows linearly with context length, expect higher totals at longer sequence lengths.

Quant	Weights	KV cache	Total
FP32	32.0 GB	1.07 GB	37.0 GB
BF16	16.0 GB	1.07 GB	19.1 GB
FP16	16.0 GB	1.07 GB	19.1 GB
Q8_0	8.5 GB	1.07 GB	10.7 GB
Q6_K	6.6 GB	1.07 GB	8.6 GB
Q5_K_Mrec	5.7 GB	1.07 GB	7.6 GB
Q4_K_M	4.9 GB	1.07 GB	6.7 GB
Q3_K_M	3.9 GB	1.07 GB	5.5 GB
Q2_K	3.0 GB	1.07 GB	4.6 GB
NVFP4cuda	4.0 GB	1.07 GB	5.7 GB

Shown at 8k context with FP16 KV cache. NVFP4 needs a CUDA GPU to run. Toggle TurboQuant in the calculator to view compressed KV cache numbers.

Benchmarks

MMLU-Pro

GPQA Diamond

MATH

89.1

GPUs that run DeepSeek R1 Distill Llama 8B natively (102)

NVIDIA RTX 5090BF16 · 68.2 t/s
NVIDIA RTX 5080NVFP4 · 123 t/s
NVIDIA RTX 5070 TiNVFP4 · 114.8 t/s
NVIDIA RTX 5070NVFP4 · 86.1 t/s
NVIDIA RTX 5060 Ti 16GBNVFP4 · 57.4 t/s
NVIDIA RTX 5060NVFP4 · 57.4 t/s
NVIDIA RTX 5050NVFP4 · 41 t/s
NVIDIA RTX 4090BF16 · 38.4 t/s
NVIDIA RTX 4080NVFP4 · 91.9 t/s
NVIDIA RTX 4070 TiNVFP4 · 64.6 t/s
NVIDIA RTX 4070NVFP4 · 64.6 t/s
NVIDIA RTX 4060 Ti 16GBNVFP4 · 36.9 t/s
NVIDIA RTX 4060NVFP4 · 34.8 t/s
NVIDIA RTX 3090BF16 · 35.6 t/s
NVIDIA RTX 3090 TiBF16 · 38.4 t/s
NVIDIA RTX 3080 10GBNVFP4 · 97.4 t/s
NVIDIA RTX 3060 12GBNVFP4 · 46.1 t/s
NVIDIA H100 80GBFP32 · 65.8 t/s
NVIDIA A100 80GBFP32 · 40.1 t/s
NVIDIA A100 40GBFP32 · 30.6 t/s
NVIDIA L40SFP32 · 17 t/s
NVIDIA RTX A6000FP32 · 15.1 t/s
NVIDIA RTX 4000 AdaNVFP4 · 41 t/s
NVIDIA RTX 4500 AdaBF16 · 16.4 t/s
NVIDIA RTX 5000 AdaBF16 · 21.9 t/s
NVIDIA RTX 6000 AdaFP32 · 18.9 t/s
NVIDIA RTX Pro 6000FP32 · 26.4 t/s
NVIDIA DGX Spark (128GB)FP32 · 5.4 t/s
AMD Radeon RX 7900 XTXBF16 · 36.5 t/s
AMD Radeon RX 7900 XTQ8_0 · 54.3 t/s
AMD Radeon RX 7900 GREQ8_0 · 39.1 t/s
AMD Radeon RX 6800 XTQ8_0 · 34.7 t/s
AMD Radeon PRO W7800BF16 · 21.9 t/s
AMD Radeon PRO W7900FP32 · 17 t/s
AMD Instinct MI300XFP32 · 104.2 t/s
AMD Radeon AI Pro 9700 32GBBF16 · 24.4 t/s
AMD Strix Halo (128GB)FP32 · 5 t/s
AMD Strix Halo (96GB)FP32 · 5 t/s
AMD Strix Halo (64GB)FP32 · 5 t/s
Apple M5 Max (128GB)FP32 · 14.9 t/s
Apple M5 Max (64GB)FP32 · 14.9 t/s
Apple M5 Max (48GB)FP32 · 14.9 t/s
Apple M5 Pro (48GB)FP32 · 7.4 t/s
Apple M5 Pro (36GB)BF16 · 14.4 t/s
Apple M5 Pro (24GB)Q8_0 · 25.6 t/s
Apple M5 (32GB)BF16 · 7.2 t/s
Apple M5 (16GB)Q5_K_M · 18.1 t/s
Apple M4 Ultra (384GB)FP32 · 26.4 t/s
Apple M4 Ultra (192GB)FP32 · 26.4 t/s
Apple M4 Max (128GB)FP32 · 13.2 t/s
Apple M4 Max (96GB)FP32 · 13.2 t/s
Apple M4 Max (64GB)FP32 · 13.2 t/s
Apple M4 Max (48GB)FP32 · 13.2 t/s
Apple M4 Pro (48GB)FP32 · 6.6 t/s
Apple M4 Pro (24GB)Q8_0 · 22.8 t/s
Apple M4 (32GB)BF16 · 5.6 t/s
Apple M4 (16GB)Q5_K_M · 14.2 t/s
Apple M3 Ultra (512GB)FP32 · 19.8 t/s
Apple M3 Ultra (256GB)FP32 · 19.8 t/s
Apple M3 Ultra (96GB)FP32 · 19.8 t/s
Apple M3 Max (128GB)FP32 · 9.7 t/s
Apple M3 Max (96GB)FP32 · 9.7 t/s
Apple M3 Max (64GB)FP32 · 9.7 t/s
Apple M3 Max (48GB)FP32 · 9.7 t/s
Apple M3 Max (36GB)BF16 · 18.7 t/s
Apple M3 Pro (36GB)BF16 · 7 t/s
Apple M3 Pro (18GB)Q6_K · 15.7 t/s
Apple M3 (24GB)Q8_0 · 8.4 t/s
Apple M3 (16GB)Q5_K_M · 11.8 t/s
Apple M2 Ultra (384GB)FP32 · 19.4 t/s
Apple M2 Ultra (192GB)FP32 · 19.4 t/s
Apple M2 Max (96GB)FP32 · 9.7 t/s
Apple M2 Max (64GB)FP32 · 9.7 t/s
Apple M2 Max (32GB)BF16 · 18.7 t/s
Apple M2 Pro (32GB)BF16 · 9.4 t/s
Apple M2 Pro (16GB)Q5_K_M · 23.6 t/s
Apple M2 (24GB)Q8_0 · 8.4 t/s
Apple M2 (16GB)Q5_K_M · 11.8 t/s
Apple M1 Ultra (128GB)FP32 · 19.4 t/s
Apple M1 Ultra (64GB)FP32 · 19.4 t/s
Apple M1 Max (64GB)FP32 · 9.7 t/s
Apple M1 Max (32GB)BF16 · 18.7 t/s
Apple M1 Pro (32GB)BF16 · 9.4 t/s
Apple M1 Pro (16GB)Q5_K_M · 23.6 t/s
Apple M1 (16GB)Q5_K_M · 8 t/s
Intel Arc B580 12GBQ8_0 · 30.9 t/s
Intel Arc B570 10GBQ6_K · 32.3 t/s
Intel Arc Pro B70 24GBBF16 · 17.4 t/s
Intel Arc Pro B60 24GBBF16 · 14.5 t/s
Intel Arc A770 16GBQ8_0 · 38 t/s
Intel Arc A770 8GBQ5_K_M · 49.2 t/s
Intel Arc A750 8GBQ5_K_M · 49.2 t/s
Intel Arc A580 8GBQ5_K_M · 49.2 t/s
Intel Arc A380 6GBQ3_K_M · 24.6 t/s
Intel Arc Pro A60 12GBQ8_0 · 26.1 t/s
Intel Arc Pro A50 6GBQ3_K_M · 25.4 t/s
Intel Arc Pro A40 6GBQ3_K_M · 25.4 t/s
Intel Data Center GPU Max 1550FP32 · 64.4 t/s
Intel Data Center GPU Max 1100FP32 · 24.2 t/s
Intel Arc 140V (32GB)BF16 · 5.2 t/s
Intel Arc 140V (16GB)Q5_K_M · 13.2 t/s
Intel Arc 130V (16GB)Q5_K_M · 13.2 t/s

Plus 2 GPUs that run it with CPU offload (slower)

Intel Arc A310 4GBBF16 · 1.7 t/s
CPU only (system RAM)BF16 · 2.3 t/s

Hugging Face ↗Ollama ↗Released 2025-01-20

Compare DeepSeek R1 Distill Llama 8B with other models

How to run DeepSeek R1 Distill Llama 8B locally

816244880160320

Q5_K_M needs 7.6 GB — fits a single high-end consumer GPU (24 GB).

Ollama

ollama run deepseek-r1:8b

llama.cpp

./llama-cli -m deepseek-r1-distill-llama-8b.Q5_K_M.gguf -c 8192 -ngl 99

LM Studio: Search for 'DeepSeek R1 Distill Llama 8B' in LM Studio. The Q5_K_M variant runs on any 8 GB GPU -- the most accessible reasoning model available.

Why this quantization? At 8B dense parameters, Q5_K_M costs only about 5 GB of VRAM, easily fitting on budget GPUs. Despite its small size, this distill achieves an astonishing MATH score of 89.1 and GPQA of 49.0 -- better than many models five times its size. Q5 preserves these capabilities better than Q4, and the extra 700 MB is negligible on modern hardware.

Who is DeepSeek R1 Distill Llama 8B for?

Students, hobbyists, and developers with budget GPUs (8 GB VRAM) who want reasoning capabilities on accessible hardware. This is the easiest way to experience chain-of-thought reasoning locally. The MIT license and tiny footprint make it ideal for embedded applications or running alongside other workloads.

Best for

Math homework help and STEM tutoring at remarkable quality for the size
Affordable chain-of-thought reasoning on consumer hardware
Learning about reasoning models without expensive GPU requirements
Lightweight reasoning assistant that runs alongside your IDE or browser

Not ideal for

General knowledge queries (MMLU-Pro: 41.0 is limited for the model's size)
Complex instruction following where larger models are needed
Professional applications where accuracy needs to be consistently high

Continue reading

vram-guides10 min

DeepSeek Family Guide: R1, V3 & Distilled Models

Frequently asked questions

What are the VRAM requirements for DeepSeek R1 Distill Llama 8B?: DeepSeek R1 Distill Llama 8B requires approximately 6.7 GB of VRAM at Q4_K_M quantization, 10.7 GB at Q8, and 19.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
How many parameters does DeepSeek R1 Distill Llama 8B have?: DeepSeek R1 Distill Llama 8B has 8 billion parameters.
Is DeepSeek R1 Distill Llama 8B good at reasoning and math?: Yes. With a MATH score of 89.1 and MMLU-Pro of 41, DeepSeek R1 Distill Llama 8B handles complex multi-step reasoning, analytical tasks, and problem-solving well.
Can DeepSeek R1 Distill Llama 8B run on a 16 GB GPU?: Yes. DeepSeek R1 Distill Llama 8B needs 6.7 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
What is the smallest quantization for DeepSeek R1 Distill Llama 8B that fits in 24 GB of VRAM?: At BF16, DeepSeek R1 Distill Llama 8B needs 19.1 GB — the highest-quality quantization that fits in 24 GB of VRAM.
What GPU do I need to run DeepSeek R1 Distill Llama 8B locally?: A 16 GB GPU is enough. At Q4_K_M, DeepSeek R1 Distill Llama 8B needs 6.7 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).