Qwen 2.5 Coder 32B Instruct
Qwen 2.5 Coder 32B Instruct needs roughly 22.9 GB VRAM at Q4_K_M quantization (75.2 GB at FP16). 75 GPUs we track can run it fully in VRAM at 8k context.
75 GPUs run this natively · 19 with CPU offload
Qwen 2.5 Coder 32B Instruct is a 32.5B parameter dense model developed by Alibaba. November 2024 coding-specialized variant — best open-weight coding model at this size.
To run Qwen 2.5 Coder 32B Instruct locally: Same VRAM requirements as Qwen2.5-32B (~18-20GB Q4). The top choice for developers with 24GB GPUs.
HumanEval 92.7% is exceptional, rivaling much larger models. MMLU-Pro 50.4% shows strong general capabilities too.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 130.0 GB | 2.15 GB | 148.0 GB |
| BF16 | 65.0 GB | 2.15 GB | 75.2 GB |
| FP16 | 65.0 GB | 2.15 GB | 75.2 GB |
| Q8_0 | 32.5 GB | 2.15 GB | 38.8 GB |
| Q6_K | 26.6 GB | 2.15 GB | 32.3 GB |
| Q5_K_M | 20.9 GB | 2.15 GB | 25.9 GB |
| Q4_K_Mrec | 18.3 GB | 2.15 GB | 22.9 GB |
| Q3_K_M | 14.0 GB | 2.15 GB | 18.1 GB |
| Q2_K | 10.7 GB | 2.15 GB | 14.4 GB |
| NVFP4cuda | 16.3 GB | 2.15 GB | 20.6 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Qwen 2.5 Coder 32B Instruct natively (75)
- NVIDIA RTX 5090NVFP4 · 110.3 t/s
- NVIDIA RTX 5080Q2_K · 89.8 t/s
- NVIDIA RTX 5070 TiQ2_K · 83.8 t/s
- NVIDIA RTX 5060 Ti 16GBQ2_K · 41.9 t/s
- NVIDIA RTX 4090NVFP4 · 62 t/s
- NVIDIA RTX 4080Q2_K · 67.1 t/s
- NVIDIA RTX 4060 Ti 16GBQ2_K · 26.9 t/s
- NVIDIA RTX 3090NVFP4 · 57.6 t/s
- NVIDIA RTX 3090 TiNVFP4 · 62 t/s
- NVIDIA H100 80GBBF16 · 51.5 t/s
- NVIDIA A100 80GBBF16 · 31.4 t/s
- NVIDIA A100 40GBNVFP4 · 95.7 t/s
- NVIDIA L40SNVFP4 · 53.2 t/s
- NVIDIA RTX A6000NVFP4 · 47.3 t/s
- NVIDIA RTX 4000 AdaQ3_K_M · 22.9 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 26.6 t/s
- NVIDIA RTX 5000 AdaNVFP4 · 35.4 t/s
- NVIDIA RTX 6000 AdaNVFP4 · 59.1 t/s
- NVIDIA RTX Pro 6000BF16 · 20.7 t/s
- NVIDIA DGX Spark (128GB)BF16 · 4.2 t/s
- AMD Radeon RX 7900 XTXQ3_K_M · 68.7 t/s
- AMD Radeon RX 7900 XTQ3_K_M · 57.2 t/s
- AMD Radeon RX 7900 GREQ2_K · 53.9 t/s
- AMD Radeon RX 6800 XTQ2_K · 47.9 t/s
- AMD Radeon PRO W7800Q5_K_M · 27.5 t/s
- AMD Radeon PRO W7900Q8_0 · 26.6 t/s
- AMD Instinct MI300XFP32 · 40.8 t/s
- AMD Radeon AI Pro 9700 32GBQ5_K_M · 30.6 t/s
- AMD Strix Halo (128GB)BF16 · 3.9 t/s
- AMD Strix Halo (96GB)BF16 · 3.9 t/s
- AMD Strix Halo (64GB)Q8_0 · 7.9 t/s
- Apple M5 Max (128GB)BF16 · 9.4 t/s
- Apple M5 Max (64GB)Q8_0 · 18.9 t/s
- Apple M5 Max (48GB)Q8_0 · 18.9 t/s
- Apple M5 Pro (48GB)Q8_0 · 9.4 t/s
- Apple M5 Pro (36GB)Q5_K_M · 14.7 t/s
- Apple M5 Pro (24GB)Q3_K_M · 22 t/s
- Apple M5 (32GB)Q5_K_M · 7.3 t/s
- Apple M4 Ultra (384GB)FP32 · 8.4 t/s
- Apple M4 Ultra (192GB)FP32 · 8.4 t/s
- Apple M4 Max (128GB)BF16 · 8.4 t/s
- Apple M4 Max (96GB)BF16 · 8.4 t/s
- Apple M4 Max (64GB)Q8_0 · 16.8 t/s
- Apple M4 Max (48GB)Q8_0 · 16.8 t/s
- Apple M4 Pro (48GB)Q8_0 · 8.4 t/s
- Apple M4 Pro (24GB)Q3_K_M · 19.5 t/s
- Apple M4 (32GB)Q5_K_M · 5.7 t/s
- Apple M3 Ultra (512GB)FP32 · 6.3 t/s
- Apple M3 Ultra (256GB)FP32 · 6.3 t/s
- Apple M3 Ultra (96GB)BF16 · 12.6 t/s
- Apple M3 Max (128GB)BF16 · 6.2 t/s
- Apple M3 Max (96GB)BF16 · 6.2 t/s
- Apple M3 Max (64GB)Q8_0 · 12.3 t/s
- Apple M3 Max (48GB)Q8_0 · 12.3 t/s
- Apple M3 Max (36GB)Q5_K_M · 19.1 t/s
- Apple M3 Pro (36GB)Q5_K_M · 7.2 t/s
- Apple M3 (24GB)Q3_K_M · 7.2 t/s
- Apple M2 Ultra (384GB)FP32 · 6.2 t/s
- Apple M2 Ultra (192GB)FP32 · 6.2 t/s
- Apple M2 Max (96GB)BF16 · 6.2 t/s
- Apple M2 Max (64GB)Q8_0 · 12.3 t/s
- Apple M2 Max (32GB)Q5_K_M · 19.1 t/s
- Apple M2 Pro (32GB)Q5_K_M · 9.6 t/s
- Apple M2 (24GB)Q3_K_M · 7.2 t/s
- Apple M1 Ultra (128GB)BF16 · 12.3 t/s
- Apple M1 Ultra (64GB)Q8_0 · 24.6 t/s
- Apple M1 Max (64GB)Q8_0 · 12.3 t/s
- Apple M1 Max (32GB)Q5_K_M · 19.1 t/s
- Apple M1 Pro (32GB)Q5_K_M · 9.6 t/s
- Intel Arc Pro B70 24GBQ3_K_M · 32.6 t/s
- Intel Arc Pro B60 24GBQ3_K_M · 27.2 t/s
- Intel Arc A770 16GBQ2_K · 52.4 t/s
- Intel Data Center GPU Max 1550BF16 · 50.4 t/s
- Intel Data Center GPU Max 1100Q8_0 · 37.8 t/s
- Intel Arc 140V (32GB)Q5_K_M · 6.5 t/s
Plus 19 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5070NVFP4 · 10.3 t/s
- NVIDIA RTX 5060NVFP4 · 6.9 t/s
- NVIDIA RTX 5050NVFP4 · 4.9 t/s
- NVIDIA RTX 4070 TiNVFP4 · 7.8 t/s
- NVIDIA RTX 4070NVFP4 · 7.8 t/s
- NVIDIA RTX 4060NVFP4 · 4.2 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 11.7 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 5.5 t/s
- Intel Arc B580 12GBQ6_K · 4.3 t/s
- Intel Arc B570 10GBQ6_K · 3.6 t/s
- Intel Arc A770 8GBQ6_K · 4.8 t/s
- Intel Arc A750 8GBQ6_K · 4.8 t/s
- Intel Arc A580 8GBQ6_K · 4.8 t/s
- Intel Arc A380 6GBQ5_K_M · 2.2 t/s
- Intel Arc A310 4GBQ5_K_M · 1.5 t/s
- Intel Arc Pro A60 12GBQ6_K · 3.6 t/s
- Intel Arc Pro A50 6GBQ5_K_M · 2.3 t/s
- Intel Arc Pro A40 6GBQ5_K_M · 2.3 t/s
- CPU only (system RAM)Q5_K_M · 0.5 t/s
Notes
Best open-weight coding model at this size.
Compare Qwen 2.5 Coder 32B Instruct with other models
Frequently asked questions
- What are the VRAM requirements for Qwen 2.5 Coder 32B Instruct?
- Qwen 2.5 Coder 32B Instruct requires approximately 22.9 GB of VRAM at Q4_K_M quantization, 38.8 GB at Q8, and 75.2 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Qwen 2.5 Coder 32B Instruct have?
- Qwen 2.5 Coder 32B Instruct has 32.5 billion parameters.
- How capable is Qwen 2.5 Coder 32B Instruct?
- With an MMLU-Pro score of 50.4, Qwen 2.5 Coder 32B Instruct delivers solid general-purpose performance suitable for most everyday tasks and professional use.
- Can Qwen 2.5 Coder 32B Instruct run on a 16 GB GPU?
- No. At Q4_K_M, Qwen 2.5 Coder 32B Instruct needs 22.9 GB of VRAM — more than 16 GB. You will need a 24 GB GPU like the RTX 4090 or RTX 3090.
- Can Qwen 2.5 Coder 32B Instruct run on a 24 GB GPU?
- Yes. Qwen 2.5 Coder 32B Instruct fits in a 24 GB GPU at Q4_K_M, requiring 22.9 GB VRAM. GPUs with 24 GB include the RTX 4090, RTX 3090, and RTX 3090 Ti.
- What is the smallest quantization for Qwen 2.5 Coder 32B Instruct that fits in 24 GB of VRAM?
- At NVFP4, Qwen 2.5 Coder 32B Instruct needs 20.6 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Qwen 2.5 Coder 32B Instruct locally?
- A 24 GB GPU is the minimum. At Q4_K_M, Qwen 2.5 Coder 32B Instruct needs 22.9 GB VRAM. Good options: RTX 4090 (24 GB), RTX 3090 (24 GB).