CanItRun Logocanitrun.

Multimodal LLMs

14models · local AI VRAM requirements & GPU compatibility

Multimodal models understand images, audio, or video in addition to text. Running them locally requires the same VRAM as their text-only counterparts for the language backbone, plus additional memory for the vision encoder during inference. Check each model page for the exact VRAM breakdown.

Want to check your specific GPU? Use the homepage calculator to see which of these models fit your hardware with estimated tokens per second.