Run LLMs on Your Own Hardware

Name: CanItRun
Author: CanItRun

Find which open-weight models fit your GPU, learn how to set them up, and discover the best tools for local AI — all in one place.

1Start here if you own a GPU

I have a GPU — what can I run? ↓

See which open-weight LLMs fit your GPU — with quantization levels, benchmarks, and tokens/sec estimates.

2Buying or upgrading

I want to choose a GPU →

Compare VRAM, bandwidth, and real-world performance across GPUs and Apple Silicon.

3Brand new to all of this

I'm new to local LLMs →

Quantization, Ollama, VRAM vs RAM — everything to run your first model in 10 minutes.

100+ GPUs tracked50+ models benchmarked25+ in-depth guides20+ AI apps catalogued

Already have a GPU? Pick it below to see which models fit — at which quantization, with benchmarks and estimated performance.

Your hardware

GPUs

24GB VRAM · 1008 GB/s · CUDA

System RAM 32 GB

Used for CPU offload when models don't fit in VRAM.

Context length 8,192 tokens

Larger context = larger KV cache.

Minimum quantization

Only show models that fit at this quality or higher.

TurboQuant KV Cacheexp

Reduces KV cache memory ~4.6× via 3.5-bit compression.

Fits fully44

With offload7

Won't run29

GLM-4.5 Air 106B
Z.ai · 106B (12B active) · reasoning
Q2_K81.4
GLM-4.6V 106B
Z.ai · 106B (12B active)
Q2_K79.9
Qwen 2.5 72B Instruct
Alibaba · 72B
NVFP471.1

Columns: verdict · model · best quant · headline benchmark · est. tokens/sec. Hover the score for its benchmark. Click a row for the full breakdown.

LLM Compatibility for NVIDIA RTX 4090 (24 GB VRAM)

This table shows which open-weight LLMs fit on an NVIDIA RTX 4090 at 4,096 tokens of context. The interactive calculator above provides per-quantization detail and lets you switch GPUs — this static view is for search engines and readers without JavaScript.

Model	Params	Best Quant	VRAM Needed	Est. tok/s	Status
DeepSeek V4 Pro 1.6TDeepSeek	1600B	—	—	—	No
MiMo V2.5 ProXiaomi	1020B	—	—	—	No
Kimi K2.6Moonshot AI	1000B	—	—	—	No
Kimi K2.5Moonshot AI	1000B	—	—	—	No
InklingThinking Machines Lab	975B	—	—	—	No
GLM-5.1 754BZ.ai	754B	—	—	—	No
GLM-5.2 753BZ.ai	753B	—	—	—	No
GLM-5 744BZ.ai	744B	—	—	—	No
DeepSeek V3 671BDeepSeek	671B	—	—	—	No
DeepSeek R1 671BDeepSeek	671B	—	—	—	No
Nemotron 3 Ultra 550B-A55BNVIDIA	550B	—	—	—	No
MiniMax M1 456BMiniMax	456B	—	—	—	No
MiniMax M3MiniMax	428B	—	—	—	No
Llama 3.1 405B InstructMeta	405B	—	—	—	No
Llama 4 Maverick 400BMeta	400B	—	—	—	No
GLM-4.7 358BZ.ai	358B	—	—	—	No
GLM-4.5 355BZ.ai	355B	—	—	—	No
GLM-4.6 355BZ.ai	355B	—	—	—	No
DeepSeek V4 Flash 284BDeepSeek	284B	—	—	—	No
Qwen3 235B-A22B (MoE)Alibaba	235B	—	—	—	No
MiniMax M2.5 229BMiniMax	229B	—	—	—	No
MiniMax M2.7 229BMiniMax	229B	—	—	—	No
Step 3.7 FlashStepFun	198B	—	—	—	No
Step 3.5 FlashStepFun	196.81B	—	—	—	No
Mixtral 8x22B Instruct v0.1Mistral AI	141B	—	—	—	No
Qwen 3.5 122B-A10B (MoE)Alibaba	122B	—	—	—	No
Nemotron 3 Super 120BNVIDIA	120B	—	—	—	No
GPT-OSS 120BOpenAI	117B	—	—	—	No
Llama 4 Scout 109BMeta	109B	Q2_K	48.0 GB	2	Offload
GLM-4.5 Air 106BZ.ai	106B	Q2_K	46.1 GB	3	Offload
GLM-4.6V 106BZ.ai	106B	Q2_K	46.1 GB	3	Offload
Qwen 2.5 72B InstructAlibaba	72B	NVFP4	41.8 GB	2	Offload
Llama 3.3 70B InstructMeta	70B	NVFP4	40.7 GB	2	Offload
DeepSeek R1 Distill Llama 70BDeepSeek	70B	NVFP4	40.7 GB	2	Offload
Llama 3.1 70B InstructMeta	70B	NVFP4	40.7 GB	2	Offload
Mixtral 8x7B Instruct v0.1Mistral AI	46.7B	Q2_K	20.5 GB	39	Fits
Command-R 35BCohere	35B	Q2_K	20.9 GB	35	Fits
Qwen 3.5 35B-A3B (MoE)Alibaba	35B	NVFP4	20.1 GB	121	Fits
Qwen 3.6 35BAlibaba	35B	NVFP4	20.8 GB	35	Fits
Yi 1.5 34B Chat01.AI	34.4B	NVFP4	20.4 GB	36	Fits
Qwen3 32BAlibaba	32.8B	NVFP4	19.1 GB	38	Fits
Qwen 2.5 32B InstructAlibaba	32.5B	NVFP4	19.4 GB	38	Fits
Qwen 2.5 Coder 32B InstructAlibaba	32.5B	NVFP4	19.4 GB	38	Fits
DeepSeek R1 Distill Qwen 32BDeepSeek	32.5B	NVFP4	19.4 GB	38	Fits
Nemotron 3 Nano 30BNVIDIA	32B	NVFP4	18.2 GB	126	Fits
Gemma 4 31BGoogle	31B	NVFP4	19.2 GB	38	Fits
Qwen3 30B-A3B (MoE)Alibaba	30B	NVFP4	17.3 GB	121	Fits
Gemma 2 27B InstructGoogle	27.2B	NVFP4	17.0 GB	43	Fits
Gemma 3 27B InstructGoogle	27B	NVFP4	16.0 GB	46	Fits
Qwen 3.6 27BAlibaba	27B	NVFP4	16.0 GB	46	Fits
Bonsai 27BPrismML	27B	Ternary (Q2_0)	9.0 GB	82	Fits
Gemma 4 26B (MoE)Google	26B	NVFP4	15.3 GB	93	Fits
Mistral Small 3.1 24B InstructMistral AI	24B	NVFP4	14.2 GB	52	Fits
Mistral Small 22BMistral AI	22.2B	NVFP4	13.5 GB	54	Fits
GPT-OSS 20BOpenAI	21B	NVFP4	12.0 GB	95	Fits
Qwen3 14BAlibaba	14.8B	NVFP4	9.0 GB	81	Fits
Qwen 2.5 14B InstructAlibaba	14.7B	NVFP4	9.1 GB	80	Fits
Phi-4 14B InstructMicrosoft	14B	NVFP4	8.6 GB	85	Fits
Mistral Nemo 12B InstructMistral AI	12.2B	NVFP4	7.6 GB	97	Fits
Gemma 3 12B InstructGoogle	12.2B	NVFP4	7.4 GB	99	Fits
Gemma 4 12B (Unified)Google	12B	NVFP4	8.5 GB	86	Fits
Gemma 2 9B InstructGoogle	9.2B	BF16	22.2 GB	33	Fits
Llama 3.1 8B InstructMeta	8B	BF16	18.5 GB	40	Fits
DeepSeek R1 Distill Llama 8BDeepSeek	8B	BF16	18.5 GB	40	Fits
Qwen3 8BAlibaba	8B	BF16	18.6 GB	40	Fits
Qwen 2.5 7B InstructAlibaba	7.6B	BF16	17.3 GB	42	Fits
Mistral 7B Instruct v0.3Mistral AI	7.25B	BF16	16.8 GB	44	Fits
Gemma 3 4B InstructGoogle	4B	FP32	18.2 GB	40	Fits
Gemma 4 E4BGoogle	4B	FP32	18.5 GB	40	Fits
Phi-3.5 Mini InstructMicrosoft	3.8B	FP32	18.8 GB	39	Fits
Llama 3.2 3B InstructMeta	3.2B	FP32	14.9 GB	49	Fits
Qwen 2.5 3B InstructAlibaba	3.1B	FP32	14.1 GB	52	Fits
Gemma 2 2B InstructGoogle	2.6B	FP32	12.1 GB	61	Fits
Gemma 4 E2BGoogle	2B	FP32	9.2 GB	80	Fits
SmolLM2 1.7B InstructHugging Face	1.7B	FP32	8.5 GB	86	Fits
Qwen 2.5 1.5B InstructAlibaba	1.5B	FP32	6.8 GB	107	Fits
Llama 3.2 1B InstructMeta	1.24B	FP32	5.7 GB	129	Fits
Gemma 3 1B InstructGoogle	1B	FP32	4.7 GB	157	Fits
Qwen 2.5 0.5B InstructAlibaba	0.5B	FP32	2.3 GB	320	Fits
SmolLM2 360M InstructHugging Face	0.36B	FP32	1.8 GB	408	Fits

Showing 80 models sorted by parameter count. VRAM estimates include model weights (at best-fitting quantization), KV cache for 4,096 tokens, and ~12% activation overhead. Actual performance varies with inference framework and system configuration.

Featured Guides

Practical guides for running open-weight LLMs on your hardware — from choosing the right quantization format to finding the best model for your VRAM budget.

Tutorial10 min read

GGUF Quantization Explained: Q4, Q5, Q6, Q8 Compared

Quantization is the single most important technique for running large models on consumer hardware. Here is how each GGUF quantization level actually works and when to use it.

June 28, 2026

VRAM Guide8 min read

Best LLMs You Can Run on 8 GB VRAM (2026)

Eight gigabytes of VRAM is tight but far from useless. Here are the best models that actually fit, with honest trade-offs and practical setup advice.

June 28, 2026

Hardware11 min read

Best GPU for Running LLMs Locally in 2026

Buying a GPU for local LLMs? VRAM matters more than compute. Here is a practical buyer's guide covering every budget tier from $200 to $2000 and beyond.

June 28, 2026

Tutorial9 min read

Getting Started with Ollama: Run Any LLM in One Command

Ollama makes running LLMs locally as simple as a single terminal command. Here is everything you need from installation to advanced customization.

June 28, 2026

VRAM Guide9 min read

Best LLMs for 16 GB VRAM GPUs (2026)

Sixteen gigabytes of VRAM is the sweet spot for local LLM inference in 2026. Here are the models that make the most of it across coding, reasoning, and general use.

July 20, 2026

VRAM Guide9 min read

How Much VRAM Does Llama 3 Need? Complete Guide

Find out exactly how much VRAM you need to run Llama 3 models locally, from the 8B variant on a budget GPU to the full 405B on multi-GPU setups.

June 28, 2026

Browse all guides →

Trending Models

The most popular open-weight models right now — see how they compare and which GPUs can run them.

Qwen 3.6 27B27B

AlibabaQ4_K_M rec.

DeepSeek R1 Distill Qwen 32B32.5B

DeepSeekQ4_K_M rec.

Llama 4 Scout 109B109B (17B active)

MetaQ4_K_M rec.

Qwen 3.5 35B-A3B (MoE)35B (3B active)

AlibabaQ4_K_M rec.

Gemma 4 31B31B

GoogleQ4_K_M rec.

Llama 3.3 70B Instruct70B

MetaQ4_K_M rec.

Browse all models →

Popular GPUs for Local LLMs

These GPUs are the most popular choices for local LLM inference — from consumer cards to Apple Silicon.

NVIDIA RTX 309024 GB

NVIDIAConsumer936 GB/s

NVIDIA RTX 409024 GB

NVIDIAConsumer1008 GB/s

NVIDIA RTX 509032 GB

NVIDIAConsumer1792 GB/s

Apple M4 Max (128GB)128 GB

AppleLaptop546 GB/s

Browse all GPUs →

Run AI apps with the right model

Not every AI app needs the same hardware. Coding agents, chat frontends, roleplay tools, and self-hosted apps can use local models, OpenRouter models, or both. Find which apps work with your setup — and which models make them useful.

8 apps→

Coding Agents

Cline, Roo Code, Aider, Continue, Claude Code — find which works with your hardware.

7 apps→

Chat Frontends

Open WebUI, LibreChat, SillyTavern — self-hosted or cloud, find your setup.

6 apps→

Local LLM Tools

Ollama, LM Studio, llama.cpp, vLLM — pick the right engine for your GPU.

3 apps→

Self-Hosted Apps

Open WebUI, LibreChat, SillyTavern — run AI apps on your own hardware with full privacy.

Browse all apps & agents →

Why This Tool Exists

Running large language models locally gives you privacy, control, and zero inference costs — but figuring out which models fit on your GPU is a manual, error-prone process. CanItRun eliminates the guesswork. We built this as a free, open tool for developers, researchers, and hobbyists who want to experiment with open-weight LLMs without cloud dependencies.

How it works:Model VRAM requirements are calculated from three components: base model weights (adjusted for quantization level), key-value cache for your target context length, and activation memory for inference. We then compare these requirements against a comprehensive database of real GPU specifications and community-reported benchmarks to tell you not just whether a model fits, but how it's likely to perform.

Who should use this:ML engineers prototyping locally, researchers on academic budgets, students learning about LLMs, and hobbyists running models on consumer hardware. If you're evaluating whether to upgrade your GPU or trying to squeeze the largest model onto your existing setup, this tool helps you make data-driven decisions.

What sets CanItRun apart: Unlike generic calculators, we maintain an extensive GPU database with real-world benchmarks, provide quantization-specific recommendations (from FP16 down to INT4), and show expected tokens-per-second performance based on community data. We also track emerging architectures and new GPU releases to keep recommendations current.

Common Use Cases

Local Development & Prototyping

Test prompts, fine-tune adapters, and iterate on RAG pipelines without incurring API costs or sending proprietary data to the cloud.

Privacy-Sensitive Applications

Process legal documents, medical records, or internal communications with zero data leaving your infrastructure.

Education & Research

Students and researchers can experiment with state-of-the-art models on academic budgets using consumer or lab hardware.

Edge Deployment

Evaluate which models fit on target deployment hardware — from Jetson Orin to Mac mini to gaming laptops with limited VRAM.