Command-R 35B
Command-R 35B needs roughly 17.5GB VRAM at Q4 quantization (70.0GB at FP16). 33 GPUs we track can run it fully in VRAM at 8k context.
Cohere35B params125k contextCC-BY-NC 4.0Non-commercial only
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP16 | 70.0 GB | 10.74 GB | 90.4 GB |
| Q8 | 35.0 GB | 10.74 GB | 51.2 GB |
| Q6_K | 26.3 GB | 10.74 GB | 41.4 GB |
| Q5_K_M | 21.9 GB | 10.74 GB | 36.5 GB |
| Q4_K_M | 17.5 GB | 10.74 GB | 31.6 GB |
| Q3_K_M | 14.0 GB | 10.74 GB | 27.7 GB |
| Q2_K | 10.5 GB | 10.74 GB | 23.8 GB |
Benchmarks
GPUs that run Command-R 35B natively (33)
- NVIDIA RTX 5090Q3_K_M · 128 t/s
- NVIDIA H100 80GBQ8 · 95.7 t/s
- NVIDIA A100 80GBQ8 · 58.3 t/s
- NVIDIA A100 40GBQ5_K_M · 71.1 t/s
- NVIDIA L40SQ6_K · 32.9 t/s
- NVIDIA RTX A6000Q6_K · 29.3 t/s
- NVIDIA RTX 6000 AdaQ6_K · 36.6 t/s
- AMD Instinct MI300XFP16 · 75.7 t/s
- Apple M4 Ultra (384GB)FP16 · 15.6 t/s
- Apple M4 Ultra (192GB)FP16 · 15.6 t/s
- Apple M4 Max (128GB)FP16 · 7.8 t/s
- Apple M4 Max (96GB)FP16 · 7.8 t/s
- Apple M4 Max (64GB)Q8 · 15.6 t/s
- Apple M4 Max (48GB)Q6_K · 20.8 t/s
- Apple M4 Pro (48GB)Q6_K · 10.4 t/s
- Apple M4 (32GB)Q3_K_M · 8.6 t/s
- Apple M3 Max (128GB)FP16 · 5.7 t/s
- Apple M3 Max (96GB)FP16 · 5.7 t/s
- Apple M3 Max (64GB)Q8 · 11.4 t/s
- Apple M3 Max (48GB)Q6_K · 15.2 t/s
- Apple M3 Max (36GB)Q4_K_M · 22.9 t/s
- Apple M3 Pro (36GB)Q4_K_M · 8.6 t/s
- Apple M2 Ultra (384GB)FP16 · 11.4 t/s
- Apple M2 Ultra (192GB)FP16 · 11.4 t/s
- Apple M2 Max (96GB)FP16 · 5.7 t/s
- Apple M2 Max (64GB)Q8 · 11.4 t/s
- Apple M2 Max (32GB)Q3_K_M · 28.6 t/s
- Apple M2 Pro (32GB)Q3_K_M · 14.3 t/s
- Apple M1 Ultra (128GB)FP16 · 11.4 t/s
- Apple M1 Ultra (64GB)Q8 · 22.9 t/s
- Apple M1 Max (64GB)Q8 · 11.4 t/s
- Apple M1 Max (32GB)Q3_K_M · 28.6 t/s
- Apple M1 Pro (32GB)Q3_K_M · 14.3 t/s
Plus 15 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 4090Q6_K · 9.6 t/s
- NVIDIA RTX 4080Q5_K_M · 8.2 t/s
- NVIDIA RTX 4070 TiQ5_K_M · 5.8 t/s
- NVIDIA RTX 4070Q5_K_M · 5.8 t/s
- NVIDIA RTX 4060 Ti 16GBQ5_K_M · 3.3 t/s
- NVIDIA RTX 4060Q4_K_M · 3.9 t/s
- NVIDIA RTX 3090Q6_K · 8.9 t/s
- NVIDIA RTX 3090 TiQ6_K · 9.6 t/s
- NVIDIA RTX 3080 10GBQ4_K_M · 10.9 t/s
- NVIDIA RTX 3060 12GBQ5_K_M · 4.1 t/s
- AMD Radeon RX 7900 XTXQ6_K · 9.1 t/s
- AMD Radeon RX 7900 XTQ6_K · 7.6 t/s
- AMD Radeon RX 6800 XTQ5_K_M · 5.9 t/s
- Intel Arc A770 16GBQ5_K_M · 6.4 t/s
- CPU only (system RAM)Q3_K_M · 0.7 t/s
Notes
Full attention (no GQA) — heavy KV cache at long context.
