GPT-OSS 120B
GPT-OSS 120B needs roughly 74.5 GB VRAM at Q4_K_M quantization (262.8 GB at FP16). 39 GPUs we track can run it fully in VRAM at 8k context.
39 GPUs run this natively · 14 with CPU offload
GPT-OSS 120B is a Mixture of Experts (MoE) model with 117B total parameters but only 5B active per token developed by OpenAI. August 2025 117B MoE with 5B active. Alternating sliding+full attention. Apache 2.0.
To run GPT-OSS 120B locally: Q4_K_M ~70-80GB — fits on 80GB GPU or Mac Studio. Best open reasoning model at this size. As a MoE model, inference speed depends on active parameters (5B) rather than total size.
GPQA 80.1% — near-parity with o4-mini. Fits on single 80GB GPU at Q4.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 468.0 GB | 0.60 GB | 524.8 GB |
| BF16 | 234.0 GB | 0.60 GB | 262.8 GB |
| FP16 | 234.0 GB | 0.60 GB | 262.8 GB |
| Q8_0 | 117.0 GB | 0.60 GB | 131.7 GB |
| Q6_K | 95.9 GB | 0.60 GB | 108.1 GB |
| Q5_K_M | 75.3 GB | 0.60 GB | 85.1 GB |
| Q4_K_Mrec | 65.9 GB | 0.60 GB | 74.5 GB |
| Q3_K_M | 50.3 GB | 0.60 GB | 57.0 GB |
| Q2_K | 38.5 GB | 0.60 GB | 43.8 GB |
| NVFP4cuda | 58.5 GB | 0.60 GB | 66.2 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run GPT-OSS 120B natively (39)
- NVIDIA H100 80GBNVFP4 · 1474 t/s
- NVIDIA A100 80GBNVFP4 · 897.2 t/s
- NVIDIA L40SQ2_K · 577.8 t/s
- NVIDIA RTX A6000Q2_K · 513.6 t/s
- NVIDIA RTX 6000 AdaQ2_K · 641.9 t/s
- NVIDIA RTX Pro 6000NVFP4 · 591.4 t/s
- NVIDIA DGX Spark (128GB)NVFP4 · 120.1 t/s
- AMD Radeon PRO W7900Q2_K · 577.8 t/s
- AMD Instinct MI300XQ8_0 · 1166 t/s
- AMD Strix Halo (128GB)Q6_K · 68.7 t/s
- AMD Strix Halo (96GB)Q5_K_M · 87.5 t/s
- AMD Strix Halo (64GB)Q3_K_M · 131 t/s
- Apple M5 Max (128GB)Q6_K · 164.7 t/s
- Apple M5 Max (64GB)Q3_K_M · 314.1 t/s
- Apple M5 Max (48GB)Q2_K · 410.6 t/s
- Apple M5 Pro (48GB)Q2_K · 205.3 t/s
- Apple M4 Ultra (384GB)BF16 · 120.1 t/s
- Apple M4 Ultra (192GB)Q8_0 · 240.2 t/s
- Apple M4 Max (128GB)Q6_K · 146.5 t/s
- Apple M4 Max (96GB)Q5_K_M · 186.5 t/s
- Apple M4 Max (64GB)Q3_K_M · 279.3 t/s
- Apple M4 Max (48GB)Q2_K · 365.1 t/s
- Apple M4 Pro (48GB)Q2_K · 182.6 t/s
- Apple M3 Ultra (512GB)BF16 · 90.1 t/s
- Apple M3 Ultra (256GB)Q8_0 · 180.2 t/s
- Apple M3 Ultra (96GB)Q5_K_M · 279.8 t/s
- Apple M3 Max (128GB)Q6_K · 107.3 t/s
- Apple M3 Max (96GB)Q5_K_M · 136.6 t/s
- Apple M3 Max (64GB)Q3_K_M · 204.7 t/s
- Apple M3 Max (48GB)Q2_K · 267.5 t/s
- Apple M2 Ultra (384GB)BF16 · 88 t/s
- Apple M2 Ultra (192GB)Q8_0 · 176 t/s
- Apple M2 Max (96GB)Q5_K_M · 136.6 t/s
- Apple M2 Max (64GB)Q3_K_M · 204.7 t/s
- Apple M1 Ultra (128GB)Q6_K · 214.6 t/s
- Apple M1 Ultra (64GB)Q3_K_M · 409.3 t/s
- Apple M1 Max (64GB)Q3_K_M · 204.7 t/s
- Intel Data Center GPU Max 1550Q6_K · 878.9 t/s
- Intel Data Center GPU Max 1100Q2_K · 821.8 t/s
Plus 14 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5090Q2_K · 272.3 t/s
- NVIDIA RTX 4090Q2_K · 153.2 t/s
- NVIDIA RTX 3090Q2_K · 142.2 t/s
- NVIDIA RTX 3090 TiQ2_K · 153.2 t/s
- NVIDIA A100 40GBQ3_K_M · 180.8 t/s
- NVIDIA RTX 4000 AdaQ2_K · 48.6 t/s
- NVIDIA RTX 4500 AdaQ2_K · 65.7 t/s
- NVIDIA RTX 5000 AdaQ2_K · 87.5 t/s
- AMD Radeon RX 7900 XTXQ2_K · 145.9 t/s
- AMD Radeon RX 7900 XTQ2_K · 121.6 t/s
- AMD Radeon PRO W7800Q2_K · 87.5 t/s
- AMD Radeon AI Pro 9700 32GBQ2_K · 97.3 t/s
- Intel Arc Pro B70 24GBQ2_K · 69.3 t/s
- Intel Arc Pro B60 24GBQ2_K · 57.8 t/s
Notes
Alternating sliding+full attention MoE. Near-parity with o4-mini; fits on a single 80 GB GPU at q4.
Frequently asked questions
- What are the VRAM requirements for GPT-OSS 120B?
- GPT-OSS 120B requires approximately 74.5 GB of VRAM at Q4_K_M quantization, 131.7 GB at Q8, and 262.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does GPT-OSS 120B have?
- GPT-OSS 120B has 117 billion total parameters, but only 5 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is GPT-OSS 120B?
- GPT-OSS 120B achieves an MMLU-Pro score of 80.7, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can GPT-OSS 120B run on a 16 GB GPU?
- No. At Q4_K_M, GPT-OSS 120B needs 74.5 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can GPT-OSS 120B run on a 24 GB GPU?
- No. Even at Q4_K_M, GPT-OSS 120B needs 74.5 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for GPT-OSS 120B that fits in 24 GB of VRAM?
- GPT-OSS 120B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 43.8 GB at Q2_K.
- What GPU do I need to run GPT-OSS 120B locally?
- You need a multi-GPU server. At Q4_K_M, GPT-OSS 120B needs 74.5 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.