Small LLMs

9models · local AI VRAM requirements & GPU compatibility

Small models (under ~10B parameters) are the sweet spot for consumer hardware. Many run at full FP16 precision on an 8 GB GPU, eliminating quantization artifacts entirely. They're fast enough for real-time chat and can run on laptops and integrated graphics. Ideal if you want always-on local AI without high power draw.

Gemma 4 E4B
Google · 4B params
3.4 GB
Q4_K_M
fits 8 GB
Phi-3.5 Mini Instruct
Microsoft · 3.8B params
5.7 GB
Q4_K_M
fits 8 GB
Llama 3.2 3B Instruct
Meta · 3.2B params
2.8 GB
Q4_K_M
fits 8 GB
Qwen 2.5 3B Instruct
Alibaba · 3.1B params
2.1 GB
Q4_K_M
fits 8 GB
Gemma 2 2B Instruct
Google · 2.6B params
2.4 GB
Q4_K_M
fits 8 GB
SmolLM2 1.7B Instruct
Hugging Face · 1.7B params
2.8 GB
Q4_K_M
fits 8 GB
Qwen 2.5 1.5B Instruct
Alibaba · 1.5B params
1.1 GB
Q4_K_M
fits 8 GB
Llama 3.2 1B Instruct
Meta · 1.24B params
1.0 GB
Q4_K_M
fits 8 GB
Gemma 3 1B Instruct
Google · 1B params
0.9 GB
Q4_K_M
fits 8 GB

Want to check your specific GPU? Use the homepage calculator to see which of these models fit your hardware with estimated tokens per second.