GPT-OSS 20B
GPT-OSS 20B needs roughly 13.7 GB VRAM at Q4_K_M quantization (47.5 GB at FP16). 93 GPUs we track can run it fully in VRAM at 8k context.
93 GPUs run this natively · 11 with CPU offload
GPT-OSS 20B is a Mixture of Experts (MoE) model with 21B total parameters but only 4B active per token developed by OpenAI. August 2025 21B MoE with 4B active — matches o3-mini on key benchmarks.
To run GPT-OSS 20B locally: Q5_K_M ~14-16GB — fits on 16GB GPUs. Best reasoning model for 16GB hardware. As a MoE model, inference speed depends on active parameters (4B) rather than total size.
GPQA 71.5% at 21B scale — exceptional reasoning efficiency.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 84.0 GB | 0.40 GB | 94.5 GB |
| BF16 | 42.0 GB | 0.40 GB | 47.5 GB |
| FP16 | 42.0 GB | 0.40 GB | 47.5 GB |
| Q8_0 | 21.0 GB | 0.40 GB | 24.0 GB |
| Q6_K | 17.2 GB | 0.40 GB | 19.7 GB |
| Q5_K_Mrec | 13.5 GB | 0.40 GB | 15.6 GB |
| Q4_K_M | 11.8 GB | 0.40 GB | 13.7 GB |
| Q3_K_M | 9.0 GB | 0.40 GB | 10.6 GB |
| Q2_K | 6.9 GB | 0.40 GB | 8.2 GB |
| NVFP4cuda | 10.5 GB | 0.40 GB | 12.2 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run GPT-OSS 20B natively (93)
- NVIDIA RTX 5090NVFP4 · 985.6 t/s
- NVIDIA RTX 5080NVFP4 · 528 t/s
- NVIDIA RTX 5070 TiNVFP4 · 492.8 t/s
- NVIDIA RTX 5070Q3_K_M · 429.8 t/s
- NVIDIA RTX 5060 Ti 16GBNVFP4 · 246.4 t/s
- NVIDIA RTX 4090NVFP4 · 554.4 t/s
- NVIDIA RTX 4080NVFP4 · 394.4 t/s
- NVIDIA RTX 4070 TiQ3_K_M · 322.3 t/s
- NVIDIA RTX 4070Q3_K_M · 322.3 t/s
- NVIDIA RTX 4060 Ti 16GBNVFP4 · 158.4 t/s
- NVIDIA RTX 3090NVFP4 · 514.8 t/s
- NVIDIA RTX 3090 TiNVFP4 · 554.4 t/s
- NVIDIA RTX 3080 10GBQ2_K · 635.3 t/s
- NVIDIA RTX 3060 12GBQ3_K_M · 230.2 t/s
- NVIDIA H100 80GBBF16 · 460.6 t/s
- NVIDIA A100 80GBBF16 · 280.4 t/s
- NVIDIA A100 40GBNVFP4 · 855.3 t/s
- NVIDIA L40SNVFP4 · 475.2 t/s
- NVIDIA RTX A6000NVFP4 · 422.4 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 176 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 237.6 t/s
- NVIDIA RTX 5000 AdaNVFP4 · 316.8 t/s
- NVIDIA RTX 6000 AdaNVFP4 · 528 t/s
- NVIDIA RTX Pro 6000BF16 · 184.8 t/s
- NVIDIA DGX Spark (128GB)FP32 · 18.8 t/s
- AMD Radeon RX 7900 XTXQ6_K · 322 t/s
- AMD Radeon RX 7900 XTQ5_K_M · 341.6 t/s
- AMD Radeon RX 7900 GREQ4_K_M · 281.3 t/s
- AMD Radeon RX 6800 XTQ4_K_M · 250.1 t/s
- AMD Radeon PRO W7800Q8_0 · 158.4 t/s
- AMD Radeon PRO W7900Q8_0 · 237.6 t/s
- AMD Instinct MI300XFP32 · 364.4 t/s
- AMD Radeon AI Pro 9700 32GBQ8_0 · 176 t/s
- AMD Strix Halo (128GB)FP32 · 17.6 t/s
- AMD Strix Halo (96GB)BF16 · 35.2 t/s
- AMD Strix Halo (64GB)BF16 · 35.2 t/s
- Apple M5 Max (128GB)FP32 · 42.2 t/s
- Apple M5 Max (64GB)BF16 · 84.4 t/s
- Apple M5 Max (48GB)Q8_0 · 168.9 t/s
- Apple M5 Pro (48GB)Q8_0 · 84.4 t/s
- Apple M5 Pro (36GB)Q8_0 · 84.4 t/s
- Apple M5 Pro (24GB)Q6_K · 103 t/s
- Apple M5 (32GB)Q8_0 · 42.1 t/s
- Apple M5 (16GB)Q3_K_M · 97.8 t/s
- Apple M4 Ultra (384GB)FP32 · 75.1 t/s
- Apple M4 Ultra (192GB)FP32 · 75.1 t/s
- Apple M4 Max (128GB)FP32 · 37.5 t/s
- Apple M4 Max (96GB)BF16 · 75.1 t/s
- Apple M4 Max (64GB)BF16 · 75.1 t/s
- Apple M4 Max (48GB)Q8_0 · 150.2 t/s
- Apple M4 Pro (48GB)Q8_0 · 75.1 t/s
- Apple M4 Pro (24GB)Q6_K · 91.6 t/s
- Apple M4 (32GB)Q8_0 · 33 t/s
- Apple M4 (16GB)Q3_K_M · 76.7 t/s
- Apple M3 Ultra (512GB)FP32 · 56.3 t/s
- Apple M3 Ultra (256GB)FP32 · 56.3 t/s
- Apple M3 Ultra (96GB)BF16 · 112.6 t/s
- Apple M3 Max (128GB)FP32 · 27.5 t/s
- Apple M3 Max (96GB)BF16 · 55 t/s
- Apple M3 Max (64GB)BF16 · 55 t/s
- Apple M3 Max (48GB)Q8_0 · 110 t/s
- Apple M3 Max (36GB)Q8_0 · 110 t/s
- Apple M3 Pro (36GB)Q8_0 · 41.3 t/s
- Apple M3 Pro (18GB)Q4_K_M · 73.3 t/s
- Apple M3 (24GB)Q6_K · 33.5 t/s
- Apple M3 (16GB)Q3_K_M · 64 t/s
- Apple M2 Ultra (384GB)FP32 · 55 t/s
- Apple M2 Ultra (192GB)FP32 · 55 t/s
- Apple M2 Max (96GB)BF16 · 55 t/s
- Apple M2 Max (64GB)BF16 · 55 t/s
- Apple M2 Max (32GB)Q8_0 · 110 t/s
- Apple M2 Pro (32GB)Q8_0 · 55 t/s
- Apple M2 Pro (16GB)Q3_K_M · 127.9 t/s
- Apple M2 (24GB)Q6_K · 33.5 t/s
- Apple M2 (16GB)Q3_K_M · 64 t/s
- Apple M1 Ultra (128GB)FP32 · 55 t/s
- Apple M1 Ultra (64GB)BF16 · 110 t/s
- Apple M1 Max (64GB)BF16 · 55 t/s
- Apple M1 Max (32GB)Q8_0 · 110 t/s
- Apple M1 Pro (32GB)Q8_0 · 55 t/s
- Apple M1 Pro (16GB)Q3_K_M · 127.9 t/s
- Apple M1 (16GB)Q3_K_M · 43.5 t/s
- Intel Arc B580 12GBQ3_K_M · 291.6 t/s
- Intel Arc B570 10GBQ2_K · 317.6 t/s
- Intel Arc Pro B70 24GBQ6_K · 152.9 t/s
- Intel Arc Pro B60 24GBQ6_K · 127.4 t/s
- Intel Arc A770 16GBQ4_K_M · 273.5 t/s
- Intel Arc Pro A60 12GBQ3_K_M · 245.6 t/s
- Intel Data Center GPU Max 1550FP32 · 225.2 t/s
- Intel Data Center GPU Max 1100Q8_0 · 338 t/s
- Intel Arc 140V (32GB)Q8_0 · 37.7 t/s
- Intel Arc 140V (16GB)Q3_K_M · 87.6 t/s
- Intel Arc 130V (16GB)Q3_K_M · 87.6 t/s
Plus 11 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5060NVFP4 · 56 t/s
- NVIDIA RTX 5050NVFP4 · 40 t/s
- NVIDIA RTX 4060NVFP4 · 34 t/s
- Intel Arc A770 8GBQ8_0 · 32 t/s
- Intel Arc A750 8GBQ8_0 · 32 t/s
- Intel Arc A580 8GBQ8_0 · 32 t/s
- Intel Arc A380 6GBQ8_0 · 11.6 t/s
- Intel Arc A310 4GBQ8_0 · 7.8 t/s
- Intel Arc Pro A50 6GBQ8_0 · 12 t/s
- Intel Arc Pro A40 6GBQ8_0 · 12 t/s
- CPU only (system RAM)Q8_0 · 2.6 t/s
Notes
Smaller sibling of GPT-OSS 120B. Matches o3-mini on key benchmarks; runs on 16 GB of VRAM.
Frequently asked questions
- What are the VRAM requirements for GPT-OSS 20B?
- GPT-OSS 20B requires approximately 13.7 GB of VRAM at Q4_K_M quantization, 24.0 GB at Q8, and 47.5 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does GPT-OSS 20B have?
- GPT-OSS 20B has 21 billion total parameters, but only 4 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is GPT-OSS 20B?
- With an MMLU-Pro score of 67.86, GPT-OSS 20B delivers solid general-purpose performance suitable for most everyday tasks and professional use.
- Can GPT-OSS 20B run on a 16 GB GPU?
- Yes. GPT-OSS 20B needs 13.7 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for GPT-OSS 20B that fits in 24 GB of VRAM?
- At NVFP4, GPT-OSS 20B needs 12.2 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run GPT-OSS 20B locally?
- A 16 GB GPU is enough. At Q4_K_M, GPT-OSS 20B needs 13.7 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).