Mistral Nemo 12B Instruct
Mistral Nemo 12B Instruct needs roughly 9.2 GB VRAM at Q4_K_M quantization (28.8 GB at FP16). 102 GPUs we track can run it fully in VRAM at 8k context.
102 GPUs run this natively · 5 with CPU offload
Mistral AI12.2B params125k contextApache 2.0Commercial use ok
Mistral Nemo 12B Instruct is a 12.2B parameter dense model developed by Mistral AI. July 2024 12B model with 128K context and multilingual support.
To run Mistral Nemo 12B Instruct locally: Q5_K_M ~8-9GB — fits on 12GB GPUs.
Apache 2.0 with strong multilingual capabilities — good balance of size and quality.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 48.8 GB | 1.34 GB | 56.2 GB |
| BF16 | 24.4 GB | 1.34 GB | 28.8 GB |
| FP16 | 24.4 GB | 1.34 GB | 28.8 GB |
| Q8_0 | 12.2 GB | 1.34 GB | 15.2 GB |
| Q6_K | 10.0 GB | 1.34 GB | 12.7 GB |
| Q5_K_Mrec | 7.9 GB | 1.34 GB | 10.3 GB |
| Q4_K_M | 6.9 GB | 1.34 GB | 9.2 GB |
| Q3_K_M | 5.3 GB | 1.34 GB | 7.4 GB |
| Q2_K | 4.0 GB | 1.34 GB | 6.0 GB |
| NVFP4cuda | 6.1 GB | 1.34 GB | 8.3 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Mistral Nemo 12B Instruct natively (102)
- NVIDIA RTX 5090BF16 · 73.4 t/s
- NVIDIA RTX 5080NVFP4 · 157.4 t/s
- NVIDIA RTX 5070 TiNVFP4 · 146.9 t/s
- NVIDIA RTX 5070NVFP4 · 110.2 t/s
- NVIDIA RTX 5060 Ti 16GBNVFP4 · 73.4 t/s
- NVIDIA RTX 5060Q3_K_M · 85.4 t/s
- NVIDIA RTX 5050Q3_K_M · 61 t/s
- NVIDIA RTX 4090NVFP4 · 165.2 t/s
- NVIDIA RTX 4080NVFP4 · 117.5 t/s
- NVIDIA RTX 4070 TiNVFP4 · 82.6 t/s
- NVIDIA RTX 4070NVFP4 · 82.6 t/s
- NVIDIA RTX 4060 Ti 16GBNVFP4 · 47.2 t/s
- NVIDIA RTX 4060Q3_K_M · 51.8 t/s
- NVIDIA RTX 3090NVFP4 · 153.4 t/s
- NVIDIA RTX 3090 TiNVFP4 · 165.2 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 124.6 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 59 t/s
- NVIDIA H100 80GBFP32 · 68.6 t/s
- NVIDIA A100 80GBFP32 · 41.8 t/s
- NVIDIA A100 40GBBF16 · 63.7 t/s
- NVIDIA L40SBF16 · 35.4 t/s
- NVIDIA RTX A6000BF16 · 31.5 t/s
- NVIDIA RTX 4000 AdaNVFP4 · 52.5 t/s
- NVIDIA RTX 4500 AdaNVFP4 · 70.8 t/s
- NVIDIA RTX 5000 AdaBF16 · 23.6 t/s
- NVIDIA RTX 6000 AdaBF16 · 39.3 t/s
- NVIDIA RTX Pro 6000FP32 · 27.5 t/s
- NVIDIA DGX Spark (128GB)FP32 · 5.6 t/s
- AMD Radeon RX 7900 XTXQ8_0 · 78.7 t/s
- AMD Radeon RX 7900 XTQ8_0 · 65.6 t/s
- AMD Radeon RX 7900 GREQ8_0 · 47.2 t/s
- AMD Radeon RX 6800 XTQ8_0 · 42 t/s
- AMD Radeon PRO W7800BF16 · 23.6 t/s
- AMD Radeon PRO W7900BF16 · 35.4 t/s
- AMD Instinct MI300XFP32 · 108.6 t/s
- AMD Radeon AI Pro 9700 32GBBF16 · 26.2 t/s
- AMD Strix Halo (128GB)FP32 · 5.2 t/s
- AMD Strix Halo (96GB)FP32 · 5.2 t/s
- AMD Strix Halo (64GB)FP32 · 5.2 t/s
- Apple M5 Max (128GB)FP32 · 12.6 t/s
- Apple M5 Max (64GB)FP32 · 12.6 t/s
- Apple M5 Max (48GB)BF16 · 25.2 t/s
- Apple M5 Pro (48GB)BF16 · 12.6 t/s
- Apple M5 Pro (36GB)BF16 · 12.6 t/s
- Apple M5 Pro (24GB)Q8_0 · 25.2 t/s
- Apple M5 (32GB)Q8_0 · 12.5 t/s
- Apple M5 (16GB)Q5_K_M · 19.5 t/s
- Apple M4 Ultra (384GB)FP32 · 22.4 t/s
- Apple M4 Ultra (192GB)FP32 · 22.4 t/s
- Apple M4 Max (128GB)FP32 · 11.2 t/s
- Apple M4 Max (96GB)FP32 · 11.2 t/s
- Apple M4 Max (64GB)FP32 · 11.2 t/s
- Apple M4 Max (48GB)BF16 · 22.4 t/s
- Apple M4 Pro (48GB)BF16 · 11.2 t/s
- Apple M4 Pro (24GB)Q8_0 · 22.4 t/s
- Apple M4 (32GB)Q8_0 · 9.8 t/s
- Apple M4 (16GB)Q5_K_M · 15.3 t/s
- Apple M3 Ultra (512GB)FP32 · 16.8 t/s
- Apple M3 Ultra (256GB)FP32 · 16.8 t/s
- Apple M3 Ultra (96GB)FP32 · 16.8 t/s
- Apple M3 Max (128GB)FP32 · 8.2 t/s
- Apple M3 Max (96GB)FP32 · 8.2 t/s
- Apple M3 Max (64GB)FP32 · 8.2 t/s
- Apple M3 Max (48GB)BF16 · 16.4 t/s
- Apple M3 Max (36GB)BF16 · 16.4 t/s
- Apple M3 Pro (36GB)BF16 · 6.1 t/s
- Apple M3 Pro (18GB)Q6_K · 15 t/s
- Apple M3 (24GB)Q8_0 · 8.2 t/s
- Apple M3 (16GB)Q5_K_M · 12.7 t/s
- Apple M3 (8GB)Q2_K · 24.9 t/s
- Apple M2 Ultra (384GB)FP32 · 16.4 t/s
- Apple M2 Ultra (192GB)FP32 · 16.4 t/s
- Apple M2 Max (96GB)FP32 · 8.2 t/s
- Apple M2 Max (64GB)FP32 · 8.2 t/s
- Apple M2 Max (32GB)Q8_0 · 32.8 t/s
- Apple M2 Pro (32GB)Q8_0 · 16.4 t/s
- Apple M2 Pro (16GB)Q5_K_M · 25.5 t/s
- Apple M2 (24GB)Q8_0 · 8.2 t/s
- Apple M2 (16GB)Q5_K_M · 12.7 t/s
- Apple M2 (8GB)Q2_K · 24.9 t/s
- Apple M1 Ultra (128GB)FP32 · 16.4 t/s
- Apple M1 Ultra (64GB)FP32 · 16.4 t/s
- Apple M1 Max (64GB)FP32 · 8.2 t/s
- Apple M1 Max (32GB)Q8_0 · 32.8 t/s
- Apple M1 Pro (32GB)Q8_0 · 16.4 t/s
- Apple M1 Pro (16GB)Q5_K_M · 25.5 t/s
- Apple M1 (16GB)Q5_K_M · 8.7 t/s
- Apple M1 (8GB)Q2_K · 16.9 t/s
- Intel Arc B580 12GBQ5_K_M · 58 t/s
- Intel Arc B570 10GBQ4_K_M · 55.3 t/s
- Intel Arc Pro B70 24GBQ8_0 · 37.4 t/s
- Intel Arc Pro B60 24GBQ8_0 · 31.1 t/s
- Intel Arc A770 16GBQ8_0 · 45.9 t/s
- Intel Arc A770 8GBQ3_K_M · 97.6 t/s
- Intel Arc A750 8GBQ3_K_M · 97.6 t/s
- Intel Arc A580 8GBQ3_K_M · 97.6 t/s
- Intel Arc Pro A60 12GBQ5_K_M · 48.9 t/s
- Intel Data Center GPU Max 1550FP32 · 67.1 t/s
- Intel Data Center GPU Max 1100BF16 · 50.4 t/s
- Intel Arc 140V (32GB)Q8_0 · 11.2 t/s
- Intel Arc 140V (16GB)Q5_K_M · 17.4 t/s
- Intel Arc 130V (16GB)Q5_K_M · 17.4 t/s
Plus 5 GPUs that run it with CPU offload (slower)
- Intel Arc A380 6GBBF16 · 1.9 t/s
- Intel Arc A310 4GBBF16 · 1.3 t/s
- Intel Arc Pro A50 6GBBF16 · 2 t/s
- Intel Arc Pro A40 6GBBF16 · 2 t/s
- CPU only (system RAM)Q8_0 · 0.9 t/s
Compare Mistral Nemo 12B Instruct with other models
Frequently asked questions
- What are the VRAM requirements for Mistral Nemo 12B Instruct?
- Mistral Nemo 12B Instruct requires approximately 9.2 GB of VRAM at Q4_K_M quantization, 15.2 GB at Q8, and 28.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Mistral Nemo 12B Instruct have?
- Mistral Nemo 12B Instruct has 12.2 billion parameters.
- How capable is Mistral Nemo 12B Instruct?
- Mistral Nemo 12B Instruct has an MMLU-Pro score of 35.6, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can Mistral Nemo 12B Instruct run on a 16 GB GPU?
- Yes. Mistral Nemo 12B Instruct needs 9.2 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for Mistral Nemo 12B Instruct that fits in 24 GB of VRAM?
- At NVFP4, Mistral Nemo 12B Instruct needs 8.3 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Mistral Nemo 12B Instruct locally?
- A 16 GB GPU is enough. At Q4_K_M, Mistral Nemo 12B Instruct needs 9.2 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).