Gemma 2 2B Instruct
Gemma 2 2B Instruct needs roughly 2.6 GB VRAM at Q4_K_M quantization (6.8 GB at FP16). 106 GPUs we track can run it fully in VRAM at 8k context.
106 GPUs run this natively · 1 with CPU offload
Google2.6B params8k contextGemmaCommercial use ok
Gemma 2 2B Instruct is a 2.6B parameter dense model developed by Google. Ultra-compact 2.6B model for edge deployment.
To run Gemma 2 2B Instruct locally: Q8_K_M ~2.5GB — runs on phones and integrated graphics.
Surprisingly capable for its size — MMLU-Pro 17.8% is strong at 2B scale.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 10.4 GB | 0.87 GB | 12.6 GB |
| BF16 | 5.2 GB | 0.87 GB | 6.8 GB |
| FP16 | 5.2 GB | 0.87 GB | 6.8 GB |
| Q8_0rec | 2.6 GB | 0.87 GB | 3.9 GB |
| Q6_K | 2.1 GB | 0.87 GB | 3.4 GB |
| Q5_K_M | 1.7 GB | 0.87 GB | 2.9 GB |
| Q4_K_M | 1.5 GB | 0.87 GB | 2.6 GB |
| Q3_K_M | 1.1 GB | 0.87 GB | 2.2 GB |
| Q2_K | 0.9 GB | 0.87 GB | 1.9 GB |
| NVFP4cuda | 1.3 GB | 0.87 GB | 2.4 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Gemma 2 2B Instruct natively (106)
- NVIDIA RTX 5090FP32 · 172.3 t/s
- NVIDIA RTX 5080FP32 · 92.3 t/s
- NVIDIA RTX 5070 TiFP32 · 86.2 t/s
- NVIDIA RTX 5070BF16 · 129.2 t/s
- NVIDIA RTX 5060 Ti 16GBFP32 · 43.1 t/s
- NVIDIA RTX 5060BF16 · 86.2 t/s
- NVIDIA RTX 5050BF16 · 61.5 t/s
- NVIDIA RTX 4090FP32 · 96.9 t/s
- NVIDIA RTX 4080FP32 · 68.9 t/s
- NVIDIA RTX 4070 TiBF16 · 96.9 t/s
- NVIDIA RTX 4070BF16 · 96.9 t/s
- NVIDIA RTX 4060 Ti 16GBFP32 · 27.7 t/s
- NVIDIA RTX 4060BF16 · 52.3 t/s
- NVIDIA RTX 3090FP32 · 90 t/s
- NVIDIA RTX 3090 TiFP32 · 96.9 t/s
- NVIDIA RTX 3080 10GBBF16 · 146.2 t/s
- NVIDIA RTX 3060 12GBBF16 · 69.2 t/s
- NVIDIA H100 80GBFP32 · 322.1 t/s
- NVIDIA A100 80GBFP32 · 196.1 t/s
- NVIDIA A100 40GBFP32 · 149.5 t/s
- NVIDIA L40SFP32 · 83.1 t/s
- NVIDIA RTX A6000FP32 · 73.8 t/s
- NVIDIA RTX 4000 AdaFP32 · 30.8 t/s
- NVIDIA RTX 4500 AdaFP32 · 41.5 t/s
- NVIDIA RTX 5000 AdaFP32 · 55.4 t/s
- NVIDIA RTX 6000 AdaFP32 · 92.3 t/s
- NVIDIA RTX Pro 6000FP32 · 129.2 t/s
- NVIDIA DGX Spark (128GB)FP32 · 26.3 t/s
- AMD Radeon RX 7900 XTXFP32 · 92.3 t/s
- AMD Radeon RX 7900 XTFP32 · 76.9 t/s
- AMD Radeon RX 7900 GREFP32 · 55.4 t/s
- AMD Radeon RX 6800 XTFP32 · 49.2 t/s
- AMD Radeon PRO W7800FP32 · 55.4 t/s
- AMD Radeon PRO W7900FP32 · 83.1 t/s
- AMD Instinct MI300XFP32 · 509.6 t/s
- AMD Radeon AI Pro 9700 32GBFP32 · 61.5 t/s
- AMD Strix Halo (128GB)FP32 · 24.6 t/s
- AMD Strix Halo (96GB)FP32 · 24.6 t/s
- AMD Strix Halo (64GB)FP32 · 24.6 t/s
- Apple M5 Max (128GB)FP32 · 59 t/s
- Apple M5 Max (64GB)FP32 · 59 t/s
- Apple M5 Max (48GB)FP32 · 59 t/s
- Apple M5 Pro (48GB)FP32 · 29.5 t/s
- Apple M5 Pro (36GB)FP32 · 29.5 t/s
- Apple M5 Pro (24GB)FP32 · 29.5 t/s
- Apple M5 (32GB)FP32 · 14.7 t/s
- Apple M5 (16GB)BF16 · 29.4 t/s
- Apple M4 Ultra (384GB)FP32 · 105 t/s
- Apple M4 Ultra (192GB)FP32 · 105 t/s
- Apple M4 Max (128GB)FP32 · 52.5 t/s
- Apple M4 Max (96GB)FP32 · 52.5 t/s
- Apple M4 Max (64GB)FP32 · 52.5 t/s
- Apple M4 Max (48GB)FP32 · 52.5 t/s
- Apple M4 Pro (48GB)FP32 · 26.3 t/s
- Apple M4 Pro (24GB)FP32 · 26.3 t/s
- Apple M4 (32GB)FP32 · 11.5 t/s
- Apple M4 (16GB)BF16 · 23.1 t/s
- Apple M3 Ultra (512GB)FP32 · 78.8 t/s
- Apple M3 Ultra (256GB)FP32 · 78.8 t/s
- Apple M3 Ultra (96GB)FP32 · 78.8 t/s
- Apple M3 Max (128GB)FP32 · 38.5 t/s
- Apple M3 Max (96GB)FP32 · 38.5 t/s
- Apple M3 Max (64GB)FP32 · 38.5 t/s
- Apple M3 Max (48GB)FP32 · 38.5 t/s
- Apple M3 Max (36GB)FP32 · 38.5 t/s
- Apple M3 Pro (36GB)FP32 · 14.4 t/s
- Apple M3 Pro (18GB)FP32 · 14.4 t/s
- Apple M3 (24GB)FP32 · 9.6 t/s
- Apple M3 (16GB)BF16 · 19.2 t/s
- Apple M3 (8GB)Q8_0 · 38.5 t/s
- Apple M2 Ultra (384GB)FP32 · 76.9 t/s
- Apple M2 Ultra (192GB)FP32 · 76.9 t/s
- Apple M2 Max (96GB)FP32 · 38.5 t/s
- Apple M2 Max (64GB)FP32 · 38.5 t/s
- Apple M2 Max (32GB)FP32 · 38.5 t/s
- Apple M2 Pro (32GB)FP32 · 19.2 t/s
- Apple M2 Pro (16GB)BF16 · 38.5 t/s
- Apple M2 (24GB)FP32 · 9.6 t/s
- Apple M2 (16GB)BF16 · 19.2 t/s
- Apple M2 (8GB)Q8_0 · 38.5 t/s
- Apple M1 Ultra (128GB)FP32 · 76.9 t/s
- Apple M1 Ultra (64GB)FP32 · 76.9 t/s
- Apple M1 Max (64GB)FP32 · 38.5 t/s
- Apple M1 Max (32GB)FP32 · 38.5 t/s
- Apple M1 Pro (32GB)FP32 · 19.2 t/s
- Apple M1 Pro (16GB)BF16 · 38.5 t/s
- Apple M1 (16GB)BF16 · 13.1 t/s
- Apple M1 (8GB)Q8_0 · 26.2 t/s
- Intel Arc B580 12GBBF16 · 87.7 t/s
- Intel Arc B570 10GBBF16 · 73.1 t/s
- Intel Arc Pro B70 24GBFP32 · 43.8 t/s
- Intel Arc Pro B60 24GBFP32 · 36.5 t/s
- Intel Arc A770 16GBFP32 · 53.8 t/s
- Intel Arc A770 8GBBF16 · 98.5 t/s
- Intel Arc A750 8GBBF16 · 98.5 t/s
- Intel Arc A580 8GBBF16 · 98.5 t/s
- Intel Arc A380 6GBQ8_0 · 71.5 t/s
- Intel Arc A310 4GBQ6_K · 58.2 t/s
- Intel Arc Pro A60 12GBBF16 · 73.8 t/s
- Intel Arc Pro A50 6GBQ8_0 · 73.8 t/s
- Intel Arc Pro A40 6GBQ8_0 · 73.8 t/s
- Intel Data Center GPU Max 1550FP32 · 315 t/s
- Intel Data Center GPU Max 1100FP32 · 118.2 t/s
- Intel Arc 140V (32GB)FP32 · 13.2 t/s
- Intel Arc 140V (16GB)BF16 · 26.3 t/s
- Intel Arc 130V (16GB)BF16 · 26.3 t/s
Plus 1 GPUs that run it with CPU offload (slower)
- CPU only (system RAM)FP32 · 1 t/s
Compare Gemma 2 2B Instruct with other models
Frequently asked questions
- What are the VRAM requirements for Gemma 2 2B Instruct?
- Gemma 2 2B Instruct requires approximately 2.6 GB of VRAM at Q4_K_M quantization, 3.9 GB at Q8, and 6.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Gemma 2 2B Instruct have?
- Gemma 2 2B Instruct has 2.6 billion parameters.
- How capable is Gemma 2 2B Instruct?
- Gemma 2 2B Instruct has an MMLU-Pro score of 17.8, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can Gemma 2 2B Instruct run on a 16 GB GPU?
- Yes. Gemma 2 2B Instruct needs 2.6 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for Gemma 2 2B Instruct that fits in 24 GB of VRAM?
- At FP32, Gemma 2 2B Instruct needs 12.6 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Gemma 2 2B Instruct locally?
- A 16 GB GPU is enough. At Q4_K_M, Gemma 2 2B Instruct needs 2.6 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).