Multimodal LLMs
14models · local AI VRAM requirements & GPU compatibility
Multimodal models understand images, audio, or video in addition to text. Running them locally requires the same VRAM as their text-only counterparts for the language backbone, plus additional memory for the vision encoder during inference. Check each model page for the exact VRAM breakdown.
- DeepSeek V4 Pro 1.6TDeepSeek · 1600B params (49B active)897.1 GBQ4_K_M
- Kimi K2.6Moonshot AI · 1000B params (32B active)563.0 GBQ4_K_M
- Llama 4 Maverick 400BMeta · 400B params (17B active)228.5 GBQ4_K_M
- DeepSeek V4 Flash 284BDeepSeek · 284B params (13B active)159.8 GBQ4_K_M
- Llama 4 Scout 109BMeta · 109B params (17B active)64.0 GBQ4_K_Mfits 80 GB
- GLM-4.6V 106BZ.ai · 106B params (12B active)61.1 GBQ4_K_Mfits 80 GB
- Gemma 4 31BGoogle · 31B params21.0 GBQ4_K_Mfits 24 GB
- Gemma 3 27B InstructGoogle · 27B params16.8 GBQ4_K_Mfits 24 GB
- Gemma 4 26B (MoE)Google · 26B params (3.8B active)16.1 GBQ4_K_Mfits 24 GB
- Mistral Small 3.1 24B InstructMistral AI · 24B params14.9 GBQ4_K_Mfits 16 GB
- Gemma 3 12B InstructGoogle · 12.2B params8.0 GBQ4_K_Mfits 12 GB
- Gemma 3 4B InstructGoogle · 4B params2.8 GBQ4_K_Mfits 8 GB
- Gemma 4 E4BGoogle · 4B params3.4 GBQ4_K_Mfits 8 GB
- Gemma 4 E2BGoogle · 2B params1.6 GBQ4_K_Mfits 8 GB
Want to check your specific GPU? Use the homepage calculator to see which of these models fit your hardware with estimated tokens per second.