SmolLM2 360M Instruct
SmolLM2 360M Instruct needs roughly 0.6 GB VRAM at Q4_K_M quantization (1.2 GB at FP16). 106 GPUs we track can run it fully in VRAM at 8k context.
106 GPUs run this natively · 1 with CPU offload
Hugging Face0.36B params8k contextApache 2.0Commercial use ok
SmolLM2 360M Instruct is a 0.36B parameter dense model developed by Hugging Face. Ultra-compact 360M model for mobile and embedded.
To run SmolLM2 360M Instruct locally: Q8_K_M ~500MB — runs on virtually any device.
360M parameters — smallest practical LLM for on-device use.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 1.4 GB | 0.34 GB | 2.0 GB |
| BF16 | 0.7 GB | 0.34 GB | 1.2 GB |
| FP16 | 0.7 GB | 0.34 GB | 1.2 GB |
| Q8_0rec | 0.4 GB | 0.34 GB | 0.8 GB |
| Q6_K | 0.3 GB | 0.34 GB | 0.7 GB |
| Q5_K_M | 0.2 GB | 0.34 GB | 0.6 GB |
| Q4_K_M | 0.2 GB | 0.34 GB | 0.6 GB |
| Q3_K_M | 0.1 GB | 0.34 GB | 0.6 GB |
| Q2_K | 0.1 GB | 0.34 GB | 0.5 GB |
| NVFP4cuda | 0.2 GB | 0.34 GB | 0.6 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run SmolLM2 360M Instruct natively (106)
- NVIDIA RTX 5090FP32 · 1244.4 t/s
- NVIDIA RTX 5080FP32 · 666.7 t/s
- NVIDIA RTX 5070 TiFP32 · 622.2 t/s
- NVIDIA RTX 5070FP32 · 466.7 t/s
- NVIDIA RTX 5060 Ti 16GBFP32 · 311.1 t/s
- NVIDIA RTX 5060FP32 · 311.1 t/s
- NVIDIA RTX 5050FP32 · 222.2 t/s
- NVIDIA RTX 4090FP32 · 700 t/s
- NVIDIA RTX 4080FP32 · 497.9 t/s
- NVIDIA RTX 4070 TiFP32 · 350 t/s
- NVIDIA RTX 4070FP32 · 350 t/s
- NVIDIA RTX 4060 Ti 16GBFP32 · 200 t/s
- NVIDIA RTX 4060FP32 · 188.9 t/s
- NVIDIA RTX 3090FP32 · 650 t/s
- NVIDIA RTX 3090 TiFP32 · 700 t/s
- NVIDIA RTX 3080 10GBFP32 · 527.8 t/s
- NVIDIA RTX 3060 12GBFP32 · 250 t/s
- NVIDIA H100 80GBFP32 · 2326.4 t/s
- NVIDIA A100 80GBFP32 · 1416 t/s
- NVIDIA A100 40GBFP32 · 1079.9 t/s
- NVIDIA L40SFP32 · 600 t/s
- NVIDIA RTX A6000FP32 · 533.3 t/s
- NVIDIA RTX 4000 AdaFP32 · 222.2 t/s
- NVIDIA RTX 4500 AdaFP32 · 300 t/s
- NVIDIA RTX 5000 AdaFP32 · 400 t/s
- NVIDIA RTX 6000 AdaFP32 · 666.7 t/s
- NVIDIA RTX Pro 6000FP32 · 933.3 t/s
- NVIDIA DGX Spark (128GB)FP32 · 189.6 t/s
- AMD Radeon RX 7900 XTXFP32 · 666.7 t/s
- AMD Radeon RX 7900 XTFP32 · 555.6 t/s
- AMD Radeon RX 7900 GREFP32 · 400 t/s
- AMD Radeon RX 6800 XTFP32 · 355.6 t/s
- AMD Radeon PRO W7800FP32 · 400 t/s
- AMD Radeon PRO W7900FP32 · 600 t/s
- AMD Instinct MI300XFP32 · 3680.6 t/s
- AMD Radeon AI Pro 9700 32GBFP32 · 444.4 t/s
- AMD Strix Halo (128GB)FP32 · 177.8 t/s
- AMD Strix Halo (96GB)FP32 · 177.8 t/s
- AMD Strix Halo (64GB)FP32 · 177.8 t/s
- Apple M5 Max (128GB)FP32 · 426.4 t/s
- Apple M5 Max (64GB)FP32 · 426.4 t/s
- Apple M5 Max (48GB)FP32 · 426.4 t/s
- Apple M5 Pro (48GB)FP32 · 213.2 t/s
- Apple M5 Pro (36GB)FP32 · 213.2 t/s
- Apple M5 Pro (24GB)FP32 · 213.2 t/s
- Apple M5 (32GB)FP32 · 106.3 t/s
- Apple M5 (16GB)FP32 · 106.3 t/s
- Apple M4 Ultra (384GB)FP32 · 758.3 t/s
- Apple M4 Ultra (192GB)FP32 · 758.3 t/s
- Apple M4 Max (128GB)FP32 · 379.2 t/s
- Apple M4 Max (96GB)FP32 · 379.2 t/s
- Apple M4 Max (64GB)FP32 · 379.2 t/s
- Apple M4 Max (48GB)FP32 · 379.2 t/s
- Apple M4 Pro (48GB)FP32 · 189.6 t/s
- Apple M4 Pro (24GB)FP32 · 189.6 t/s
- Apple M4 (32GB)FP32 · 83.3 t/s
- Apple M4 (16GB)FP32 · 83.3 t/s
- Apple M3 Ultra (512GB)FP32 · 568.8 t/s
- Apple M3 Ultra (256GB)FP32 · 568.8 t/s
- Apple M3 Ultra (96GB)FP32 · 568.8 t/s
- Apple M3 Max (128GB)FP32 · 277.8 t/s
- Apple M3 Max (96GB)FP32 · 277.8 t/s
- Apple M3 Max (64GB)FP32 · 277.8 t/s
- Apple M3 Max (48GB)FP32 · 277.8 t/s
- Apple M3 Max (36GB)FP32 · 277.8 t/s
- Apple M3 Pro (36GB)FP32 · 104.2 t/s
- Apple M3 Pro (18GB)FP32 · 104.2 t/s
- Apple M3 (24GB)FP32 · 69.4 t/s
- Apple M3 (16GB)FP32 · 69.4 t/s
- Apple M3 (8GB)FP32 · 69.4 t/s
- Apple M2 Ultra (384GB)FP32 · 555.6 t/s
- Apple M2 Ultra (192GB)FP32 · 555.6 t/s
- Apple M2 Max (96GB)FP32 · 277.8 t/s
- Apple M2 Max (64GB)FP32 · 277.8 t/s
- Apple M2 Max (32GB)FP32 · 277.8 t/s
- Apple M2 Pro (32GB)FP32 · 138.9 t/s
- Apple M2 Pro (16GB)FP32 · 138.9 t/s
- Apple M2 (24GB)FP32 · 69.4 t/s
- Apple M2 (16GB)FP32 · 69.4 t/s
- Apple M2 (8GB)FP32 · 69.4 t/s
- Apple M1 Ultra (128GB)FP32 · 555.6 t/s
- Apple M1 Ultra (64GB)FP32 · 555.6 t/s
- Apple M1 Max (64GB)FP32 · 277.8 t/s
- Apple M1 Max (32GB)FP32 · 277.8 t/s
- Apple M1 Pro (32GB)FP32 · 138.9 t/s
- Apple M1 Pro (16GB)FP32 · 138.9 t/s
- Apple M1 (16GB)FP32 · 47.2 t/s
- Apple M1 (8GB)FP32 · 47.2 t/s
- Intel Arc B580 12GBFP32 · 316.7 t/s
- Intel Arc B570 10GBFP32 · 263.9 t/s
- Intel Arc Pro B70 24GBFP32 · 316.7 t/s
- Intel Arc Pro B60 24GBFP32 · 263.9 t/s
- Intel Arc A770 16GBFP32 · 388.9 t/s
- Intel Arc A770 8GBFP32 · 355.6 t/s
- Intel Arc A750 8GBFP32 · 355.6 t/s
- Intel Arc A580 8GBFP32 · 355.6 t/s
- Intel Arc A380 6GBFP32 · 129.2 t/s
- Intel Arc A310 4GBFP32 · 86.1 t/s
- Intel Arc Pro A60 12GBFP32 · 266.7 t/s
- Intel Arc Pro A50 6GBFP32 · 133.3 t/s
- Intel Arc Pro A40 6GBFP32 · 133.3 t/s
- Intel Data Center GPU Max 1550FP32 · 2275 t/s
- Intel Data Center GPU Max 1100FP32 · 853.5 t/s
- Intel Arc 140V (32GB)FP32 · 95.1 t/s
- Intel Arc 140V (16GB)FP32 · 95.1 t/s
- Intel Arc 130V (16GB)FP32 · 95.1 t/s
Plus 1 GPUs that run it with CPU offload (slower)
- CPU only (system RAM)FP32 · 7.3 t/s
Frequently asked questions
- What are the VRAM requirements for SmolLM2 360M Instruct?
- SmolLM2 360M Instruct requires approximately 0.6 GB of VRAM at Q4_K_M quantization, 0.8 GB at Q8, and 1.2 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does SmolLM2 360M Instruct have?
- SmolLM2 360M Instruct has 0.36 billion parameters.
- How capable is SmolLM2 360M Instruct?
- SmolLM2 360M Instruct has an MMLU-Pro score of 8, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can SmolLM2 360M Instruct run on a 16 GB GPU?
- Yes. SmolLM2 360M Instruct needs 0.6 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for SmolLM2 360M Instruct that fits in 24 GB of VRAM?
- At FP32, SmolLM2 360M Instruct needs 2.0 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run SmolLM2 360M Instruct locally?
- A 16 GB GPU is enough. At Q4_K_M, SmolLM2 360M Instruct needs 0.6 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).