Gemma 2 9B Instruct
Gemma 2 9B Instruct needs roughly 9.0 GB VRAM at Q4_K_M quantization (23.8 GB at FP16). 99 GPUs we track can run it fully in VRAM at 8k context.
99 GPUs run this natively · 5 with CPU offload
Gemma 2 9B Instruct is a 9.2B parameter dense model developed by Google. June 2024 9B model with knowledge distillation from 27B teacher — best performance for its size class.
To run Gemma 2 9B Instruct locally: Q5_K_M ~6-7GB — runs on 8GB GPUs. Excellent quality-per-VRAM ratio.
MMLU-Pro 32.0%, competitive with models 2-3× larger. Trained 50× beyond compute-optimal.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 36.8 GB | 2.82 GB | 44.4 GB |
| BF16 | 18.4 GB | 2.82 GB | 23.8 GB |
| FP16 | 18.4 GB | 2.82 GB | 23.8 GB |
| Q8_0 | 9.2 GB | 2.82 GB | 13.5 GB |
| Q6_K | 7.5 GB | 2.82 GB | 11.6 GB |
| Q5_K_Mrec | 5.9 GB | 2.82 GB | 9.8 GB |
| Q4_K_M | 5.2 GB | 2.82 GB | 9.0 GB |
| Q3_K_M | 4.0 GB | 2.82 GB | 7.6 GB |
| Q2_K | 3.0 GB | 2.82 GB | 6.5 GB |
| NVFP4cuda | 4.6 GB | 2.82 GB | 8.3 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Gemma 2 9B Instruct natively (99)
- NVIDIA RTX 5090BF16 · 97.4 t/s
- NVIDIA RTX 5080NVFP4 · 208.7 t/s
- NVIDIA RTX 5070 TiNVFP4 · 194.8 t/s
- NVIDIA RTX 5070NVFP4 · 146.1 t/s
- NVIDIA RTX 5060 Ti 16GBNVFP4 · 97.4 t/s
- NVIDIA RTX 5060Q3_K_M · 113.2 t/s
- NVIDIA RTX 5050Q3_K_M · 80.9 t/s
- NVIDIA RTX 4090NVFP4 · 219.1 t/s
- NVIDIA RTX 4080NVFP4 · 155.9 t/s
- NVIDIA RTX 4070 TiNVFP4 · 109.6 t/s
- NVIDIA RTX 4070NVFP4 · 109.6 t/s
- NVIDIA RTX 4060 Ti 16GBNVFP4 · 62.6 t/s
- NVIDIA RTX 4060Q3_K_M · 68.8 t/s
- NVIDIA RTX 3090NVFP4 · 203.5 t/s
- NVIDIA RTX 3090 TiNVFP4 · 219.1 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 165.2 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 78.3 t/s
- NVIDIA H100 80GBFP32 · 91 t/s
- NVIDIA A100 80GBFP32 · 55.4 t/s
- NVIDIA A100 40GBBF16 · 84.5 t/s
- NVIDIA L40SFP32 · 23.5 t/s
- NVIDIA RTX A6000FP32 · 20.9 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 69.6 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 93.9 t/s
- NVIDIA RTX 5000 AdaBF16 · 31.3 t/s
- NVIDIA RTX 6000 AdaFP32 · 26.1 t/s
- NVIDIA RTX Pro 6000FP32 · 36.5 t/s
- NVIDIA DGX Spark (128GB)FP32 · 7.4 t/s
- AMD Radeon RX 7900 XTXQ8_0 · 104.3 t/s
- AMD Radeon RX 7900 XTQ8_0 · 87 t/s
- AMD Radeon RX 7900 GREQ8_0 · 62.6 t/s
- AMD Radeon RX 6800 XTQ8_0 · 55.7 t/s
- AMD Radeon PRO W7800BF16 · 31.3 t/s
- AMD Radeon PRO W7900FP32 · 23.5 t/s
- AMD Instinct MI300XFP32 · 144 t/s
- AMD Radeon AI Pro 9700 32GBBF16 · 34.8 t/s
- AMD Strix Halo (128GB)FP32 · 7 t/s
- AMD Strix Halo (96GB)FP32 · 7 t/s
- AMD Strix Halo (64GB)FP32 · 7 t/s
- Apple M5 Max (128GB)FP32 · 16.7 t/s
- Apple M5 Max (64GB)FP32 · 16.7 t/s
- Apple M5 Max (48GB)BF16 · 33.4 t/s
- Apple M5 Pro (48GB)BF16 · 16.7 t/s
- Apple M5 Pro (36GB)BF16 · 16.7 t/s
- Apple M5 Pro (24GB)Q8_0 · 33.4 t/s
- Apple M5 (32GB)BF16 · 8.3 t/s
- Apple M5 (16GB)Q6_K · 20.3 t/s
- Apple M4 Ultra (384GB)FP32 · 29.7 t/s
- Apple M4 Ultra (192GB)FP32 · 29.7 t/s
- Apple M4 Max (128GB)FP32 · 14.8 t/s
- Apple M4 Max (96GB)FP32 · 14.8 t/s
- Apple M4 Max (64GB)FP32 · 14.8 t/s
- Apple M4 Max (48GB)BF16 · 29.7 t/s
- Apple M4 Pro (48GB)BF16 · 14.8 t/s
- Apple M4 Pro (24GB)Q8_0 · 29.7 t/s
- Apple M4 (32GB)BF16 · 6.5 t/s
- Apple M4 (16GB)Q6_K · 15.9 t/s
- Apple M3 Ultra (512GB)FP32 · 22.3 t/s
- Apple M3 Ultra (256GB)FP32 · 22.3 t/s
- Apple M3 Ultra (96GB)FP32 · 22.3 t/s
- Apple M3 Max (128GB)FP32 · 10.9 t/s
- Apple M3 Max (96GB)FP32 · 10.9 t/s
- Apple M3 Max (64GB)FP32 · 10.9 t/s
- Apple M3 Max (48GB)BF16 · 21.7 t/s
- Apple M3 Max (36GB)BF16 · 21.7 t/s
- Apple M3 Pro (36GB)BF16 · 8.2 t/s
- Apple M3 Pro (18GB)Q8_0 · 16.3 t/s
- Apple M3 (24GB)Q8_0 · 10.9 t/s
- Apple M3 (16GB)Q6_K · 13.3 t/s
- Apple M2 Ultra (384GB)FP32 · 21.7 t/s
- Apple M2 Ultra (192GB)FP32 · 21.7 t/s
- Apple M2 Max (96GB)FP32 · 10.9 t/s
- Apple M2 Max (64GB)FP32 · 10.9 t/s
- Apple M2 Max (32GB)BF16 · 21.7 t/s
- Apple M2 Pro (32GB)BF16 · 10.9 t/s
- Apple M2 Pro (16GB)Q6_K · 26.5 t/s
- Apple M2 (24GB)Q8_0 · 10.9 t/s
- Apple M2 (16GB)Q6_K · 13.3 t/s
- Apple M1 Ultra (128GB)FP32 · 21.7 t/s
- Apple M1 Ultra (64GB)FP32 · 21.7 t/s
- Apple M1 Max (64GB)FP32 · 10.9 t/s
- Apple M1 Max (32GB)BF16 · 21.7 t/s
- Apple M1 Pro (32GB)BF16 · 10.9 t/s
- Apple M1 Pro (16GB)Q6_K · 26.5 t/s
- Apple M1 (16GB)Q6_K · 9 t/s
- Intel Arc B580 12GBQ5_K_M · 77 t/s
- Intel Arc B570 10GBQ4_K_M · 73.4 t/s
- Intel Arc Pro B70 24GBQ8_0 · 49.6 t/s
- Intel Arc Pro B60 24GBQ8_0 · 41.3 t/s
- Intel Arc A770 16GBQ8_0 · 60.9 t/s
- Intel Arc A770 8GBQ3_K_M · 129.4 t/s
- Intel Arc A750 8GBQ3_K_M · 129.4 t/s
- Intel Arc A580 8GBQ3_K_M · 129.4 t/s
- Intel Arc Pro A60 12GBQ5_K_M · 64.8 t/s
- Intel Data Center GPU Max 1550FP32 · 89 t/s
- Intel Data Center GPU Max 1100FP32 · 33.4 t/s
- Intel Arc 140V (32GB)BF16 · 7.4 t/s
- Intel Arc 140V (16GB)Q6_K · 18.2 t/s
- Intel Arc 130V (16GB)Q6_K · 18.2 t/s
Plus 5 GPUs that run it with CPU offload (slower)
- Intel Arc A380 6GBBF16 · 2.5 t/s
- Intel Arc A310 4GBBF16 · 1.7 t/s
- Intel Arc Pro A50 6GBBF16 · 2.6 t/s
- Intel Arc Pro A40 6GBBF16 · 2.6 t/s
- CPU only (system RAM)BF16 · 0.6 t/s
Compare Gemma 2 9B Instruct with other models
Frequently asked questions
- What are the VRAM requirements for Gemma 2 9B Instruct?
- Gemma 2 9B Instruct requires approximately 9.0 GB of VRAM at Q4_K_M quantization, 13.5 GB at Q8, and 23.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Gemma 2 9B Instruct have?
- Gemma 2 9B Instruct has 9.2 billion parameters.
- How capable is Gemma 2 9B Instruct?
- Gemma 2 9B Instruct has an MMLU-Pro score of 32, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can Gemma 2 9B Instruct run on a 16 GB GPU?
- Yes. Gemma 2 9B Instruct needs 9.0 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for Gemma 2 9B Instruct that fits in 24 GB of VRAM?
- At BF16, Gemma 2 9B Instruct needs 23.8 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Gemma 2 9B Instruct locally?
- A 16 GB GPU is enough. At Q4_K_M, Gemma 2 9B Instruct needs 9.0 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).