Multimodal LLMs

14models · local AI VRAM requirements & GPU compatibility

Multimodal models understand images, audio, or video in addition to text. Running them locally requires the same VRAM as their text-only counterparts for the language backbone, plus additional memory for the vision encoder during inference. Check each model page for the exact VRAM breakdown.

DeepSeek V4 Pro 1.6T
DeepSeek · 1600B params (49B active)
897.1 GB
Q4_K_M
Kimi K2.6
Moonshot AI · 1000B params (32B active)
563.0 GB
Q4_K_M
Llama 4 Maverick 400B
Meta · 400B params (17B active)
228.5 GB
Q4_K_M
DeepSeek V4 Flash 284B
DeepSeek · 284B params (13B active)
159.8 GB
Q4_K_M
Llama 4 Scout 109B
Meta · 109B params (17B active)
64.0 GB
Q4_K_M
fits 80 GB
GLM-4.6V 106B
Z.ai · 106B params (12B active)
61.1 GB
Q4_K_M
fits 80 GB
Gemma 4 31B
Google · 31B params
21.0 GB
Q4_K_M
fits 24 GB
Gemma 3 27B Instruct
Google · 27B params
16.8 GB
Q4_K_M
fits 24 GB
Gemma 4 26B (MoE)
Google · 26B params (3.8B active)
16.1 GB
Q4_K_M
fits 24 GB
Mistral Small 3.1 24B Instruct
Mistral AI · 24B params
14.9 GB
Q4_K_M
fits 16 GB
Gemma 3 12B Instruct
Google · 12.2B params
8.0 GB
Q4_K_M
fits 12 GB
Gemma 3 4B Instruct
Google · 4B params
2.8 GB
Q4_K_M
fits 8 GB
Gemma 4 E4B
Google · 4B params
3.4 GB
Q4_K_M
fits 8 GB
Gemma 4 E2B
Google · 2B params
1.6 GB
Q4_K_M
fits 8 GB

Want to check your specific GPU? Use the homepage calculator to see which of these models fit your hardware with estimated tokens per second.