What are the VRAM requirements for Command-R 35B?

Command-R 35B requires approximately 35.9 GB of VRAM at Q4_K_M quantization, 53.7 GB at Q8, and 90.4 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.

How many parameters does Command-R 35B have?

Command-R 35B has 35 billion parameters.

How capable is Command-R 35B?

Command-R 35B has an MMLU-Pro score of 33, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.

Can Command-R 35B run on a 16 GB GPU?

No. At Q4_K_M, Command-R 35B needs 35.9 GB of VRAM — more than 16 GB. You will need a 48 GB GPU like the RTX 6000 Ada or a dual-GPU setup.

Can Command-R 35B run on a 24 GB GPU?

No. Even at Q4_K_M, Command-R 35B needs 35.9 GB. Consider a 48 GB card like the RTX 6000 Ada or a dual RTX 4090 setup.

What is the smallest quantization for Command-R 35B that fits in 24 GB of VRAM?

Command-R 35B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 27.0 GB at Q2_K.

What GPU do I need to run Command-R 35B locally?

You need a 48 GB GPU or a dual-GPU setup. At Q4_K_M, Command-R 35B needs 35.9 GB VRAM. Options: RTX 6000 Ada (48 GB), A6000 (48 GB), or 2× RTX 4090.

Command-R 35B

Name: Command-R 35B
Author: Cohere

Command-R 35B needs roughly 35.9 GB VRAM at Q4_K_M quantization (90.4 GB at FP16). 47 GPUs we track can run it fully in VRAM at 8k context.

47 GPUs run this natively · 36 with CPU offload

Cohere35B params125k contextCC-BY-NC 4.0Non-commercial only

Command-R 35B is a 35B parameter dense model developed by Cohere. August 2024 RAG and tool-use specialist from Cohere. 128K context with full attention (no GQA).

To run Command-R 35B locally: Q4_K_M ~20-22GB — fits on 24GB GPU but full attention means heavy KV cache at long context. 32GB+ recommended for RAG workloads.

Industry-leading retrieval-augmented generation. Strong multilingual support across 10 languages.

VRAM at each quantization

Calculated at 8k context. Since KV cache scales linearly with context, longer sessions need more VRAM than shown here.

Quant	Weights	KV cache	Total
FP32	140.0 GB	10.74 GB	168.8 GB
BF16	70.0 GB	10.74 GB	90.4 GB
FP16	70.0 GB	10.74 GB	90.4 GB
Q8_0	37.2 GB	10.74 GB	53.7 GB
Q6_K	28.7 GB	10.74 GB	44.2 GB
Q5_K_M	24.9 GB	10.74 GB	39.9 GB
Q4_K_Mrec	21.3 GB	10.74 GB	35.9 GB
Q3_K_M	16.8 GB	10.74 GB	30.9 GB
Q2_K	13.3 GB	10.74 GB	27.0 GB
NVFP4cuda	17.5 GB	10.74 GB	31.6 GB

KV cache figures assume 8k context at FP16. NVFP4 quantization requires a CUDA-capable GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.

Benchmarks

MMLU-Pro

IFEval

Arena ELO

1150

GPUs that run Command-R 35B natively (47)

NVIDIA RTX 5090Q2_K · 48.4 t/s
NVIDIA H100 80GBNVFP4 · 77.1 t/s
NVIDIA A100 80GBNVFP4 · 46.9 t/s
NVIDIA A100 40GBNVFP4 · 35.8 t/s
NVIDIA L40SNVFP4 · 19.9 t/s
NVIDIA RTX A6000NVFP4 · 17.7 t/s
NVIDIA RTX 5000 AdaQ2_K · 15.6 t/s
NVIDIA RTX 6000 AdaNVFP4 · 22.1 t/s
NVIDIA RTX Pro 6000BF16 · 10.8 t/s
NVIDIA DGX Spark (128GB)BF16 · 2.2 t/s
AMD Radeon PRO W7800Q2_K · 15.6 t/s
AMD Radeon PRO W7900Q6_K · 14.2 t/s
AMD Instinct MI300XFP32 · 22.9 t/s
AMD Radeon AI Pro 9700 32GBQ2_K · 17.3 t/s
AMD Strix Halo (128GB)BF16 · 2.1 t/s
AMD Strix Halo (96GB)Q8_0 · 3.5 t/s
AMD Strix Halo (64GB)Q8_0 · 3.5 t/s
Apple M5 Max (128GB)BF16 · 6.1 t/s
Apple M5 Max (64GB)Q8_0 · 10.2 t/s
Apple M5 Max (48GB)Q5_K_M · 13.8 t/s
Apple M5 Pro (48GB)Q5_K_M · 6.9 t/s
Apple M5 Pro (36GB)Q2_K · 10.2 t/s
Apple M4 Ultra (384GB)FP32 · 5.8 t/s
Apple M4 Ultra (192GB)FP32 · 5.8 t/s
Apple M4 Max (128GB)BF16 · 5.4 t/s
Apple M4 Max (96GB)Q8_0 · 9.1 t/s
Apple M4 Max (64GB)Q8_0 · 9.1 t/s
Apple M4 Max (48GB)Q5_K_M · 12.2 t/s
Apple M4 Pro (48GB)Q5_K_M · 6.1 t/s
Apple M3 Ultra (512GB)FP32 · 4.3 t/s
Apple M3 Ultra (256GB)FP32 · 4.3 t/s
Apple M3 Ultra (96GB)Q8_0 · 13.7 t/s
Apple M3 Max (128GB)BF16 · 4 t/s
Apple M3 Max (96GB)Q8_0 · 6.7 t/s
Apple M3 Max (64GB)Q8_0 · 6.7 t/s
Apple M3 Max (48GB)Q5_K_M · 9 t/s
Apple M3 Max (36GB)Q2_K · 13.3 t/s
Apple M3 Pro (36GB)Q2_K · 5 t/s
Apple M2 Ultra (384GB)FP32 · 4.2 t/s
Apple M2 Ultra (192GB)FP32 · 4.2 t/s
Apple M2 Max (96GB)Q8_0 · 6.7 t/s
Apple M2 Max (64GB)Q8_0 · 6.7 t/s
Apple M1 Ultra (128GB)BF16 · 7.9 t/s
Apple M1 Ultra (64GB)Q8_0 · 13.3 t/s
Apple M1 Max (64GB)Q8_0 · 6.7 t/s
Intel Data Center GPU Max 1550BF16 · 26.4 t/s
Intel Data Center GPU Max 1100Q6_K · 20.2 t/s

Plus 36 GPUs that run it with CPU offload (slower)

NVIDIA RTX 5080NVFP4 · 1.8 t/s
NVIDIA RTX 5070 TiNVFP4 · 1.8 t/s
NVIDIA RTX 5070NVFP4 · 1.4 t/s
NVIDIA RTX 5060 Ti 16GBNVFP4 · 1.7 t/s
NVIDIA RTX 5060NVFP4 · 1.2 t/s
NVIDIA RTX 5050NVFP4 · 1.2 t/s
NVIDIA RTX 4090NVFP4 · 4 t/s
NVIDIA RTX 4080NVFP4 · 1.8 t/s
NVIDIA RTX 4070 TiNVFP4 · 1.4 t/s
NVIDIA RTX 4070NVFP4 · 1.4 t/s
NVIDIA RTX 4060 Ti 16GBNVFP4 · 1.7 t/s
NVIDIA RTX 4060NVFP4 · 1.2 t/s
NVIDIA RTX 3090NVFP4 · 3.9 t/s
NVIDIA RTX 3090 TiNVFP4 · 4 t/s
NVIDIA RTX 3080 10GBNVFP4 · 1.3 t/s
NVIDIA RTX 3060 12GBNVFP4 · 1.4 t/s
NVIDIA RTX 4000 AdaNVFP4 · 2.2 t/s
NVIDIA RTX 4500 AdaNVFP4 · 3.3 t/s
AMD Radeon RX 7900 XTXQ6_K · 1.4 t/s
AMD Radeon RX 7900 XTQ6_K · 1.2 t/s
AMD Radeon RX 7900 GREQ5_K_M · 1.2 t/s
AMD Radeon RX 6800 XTQ5_K_M · 1.2 t/s
Intel Arc B580 12GBQ4_K_M · 1.2 t/s
Intel Arc B570 10GBQ3_K_M · 1.3 t/s
Intel Arc Pro B70 24GBQ6_K · 1.3 t/s
Intel Arc Pro B60 24GBQ6_K · 1.3 t/s
Intel Arc A770 16GBQ5_K_M · 1.2 t/s
Intel Arc A770 8GBQ3_K_M · 1.2 t/s
Intel Arc A750 8GBQ3_K_M · 1.2 t/s
Intel Arc A580 8GBQ3_K_M · 1.2 t/s
Intel Arc A380 6GBQ3_K_M · 1.1 t/s
Intel Arc A310 4GBQ2_K · 1.2 t/s
Intel Arc Pro A60 12GBQ4_K_M · 1.1 t/s
Intel Arc Pro A50 6GBQ3_K_M · 1.1 t/s
Intel Arc Pro A40 6GBQ3_K_M · 1.1 t/s
CPU only (system RAM)Q2_K · 1.7 t/s

Notes

Full attention (no GQA) — heavy KV cache at long context.

Hugging Face ↗Ollama ↗Released 2024-08-30

Compare Command-R 35B with other models

Command-R 35BvsQwen3 32B32.8B params

How to run Command-R 35B locally

816244880160320

Q4_K_M needs 35.9 GB — needs a workstation or datacenter GPU (48–80 GB).

Ollama

ollama run command-r:35b

llama.cpp

./llama-cli -m c4ai-command-r-35b.Q4_K_M.gguf -c 8192 -ngl 99

LM Studio: Search for 'Command R 35B' in LM Studio. Note the heavy KV cache at long context due to full attention -- keep context under 32K for best performance.

Why this quantization? Command R has 35B dense parameters with full attention (64 KV heads, no GQA), which makes its KV cache substantially larger than other models of similar size. Q4_K_M is essential for keeping the weight footprint manageable (roughly 20 GB), leaving VRAM budget for the unusually heavy KV cache. Users should be aware that the full-attention design limits practical context length compared to GQA models.

Who is Command-R 35B for?

Developers building RAG (Retrieval Augmented Generation) pipelines who want a model specifically designed for grounded, citation-backed responses. Also strong for multilingual applications. Note the CC-BY-NC license restricts commercial use.

Best for

RAG pipelines with citation generation and source attribution
Multilingual document processing and question answering
Enterprise search and knowledge base applications (non-commercial)
Grounded responses that cite their sources accurately

Not ideal for

Commercial deployments -- the CC-BY-NC 4.0 license prohibits commercial use
Long-context tasks -- the full-attention architecture makes the KV cache very heavy beyond 32K tokens
General chat or creative writing -- other 32B-class models outperform on general benchmarks

Continue reading

vram-guides9 min

Best LLMs for 48 GB VRAM (2026)

Frequently asked questions

What are the VRAM requirements for Command-R 35B?: Command-R 35B requires approximately 35.9 GB of VRAM at Q4_K_M quantization, 53.7 GB at Q8, and 90.4 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
How many parameters does Command-R 35B have?: Command-R 35B has 35 billion parameters.
How capable is Command-R 35B?: Command-R 35B has an MMLU-Pro score of 33, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
Can Command-R 35B run on a 16 GB GPU?: No. At Q4_K_M, Command-R 35B needs 35.9 GB of VRAM — more than 16 GB. You will need a 48 GB GPU like the RTX 6000 Ada or a dual-GPU setup.
Can Command-R 35B run on a 24 GB GPU?: No. Even at Q4_K_M, Command-R 35B needs 35.9 GB. Consider a 48 GB card like the RTX 6000 Ada or a dual RTX 4090 setup.
What is the smallest quantization for Command-R 35B that fits in 24 GB of VRAM?: Command-R 35B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 27.0 GB at Q2_K.
What GPU do I need to run Command-R 35B locally?: You need a 48 GB GPU or a dual-GPU setup. At Q4_K_M, Command-R 35B needs 35.9 GB VRAM. Options: RTX 6000 Ada (48 GB), A6000 (48 GB), or 2× RTX 4090.