About CanItRun
How CanItRun decides whether a model fits on your hardware, and where the data comes from.
What is CanItRun?
CanItRun is a free, open calculator for developers running large language models locally. You pick your GPU — from a laptop-class RTX 4060 to a datacenter H100 or an Apple M3 Max — and the tool instantly shows which open-weight models will fit in VRAM, at what quantization level, and roughly how many tokens per second to expect. Benchmark scores (MMLU-Pro, GPQA, Arena ELO) are shown alongside each model so you can weigh performance tradeoffs at a glance.
The calculations are based on published architecture specs from Hugging Face model cards, observed GGUF file sizes from the Ollama library, and GPU specs from manufacturer datasheets. No proprietary data, no affiliate links — just the numbers.
Memory calculation
For a given model and quantization level:
weights_GB = params_B × bytes_per_param[quant] kv_cache_GB = 2 × layers × kv_heads × head_dim × context × 2 bytes overhead = (weights + kv) × 0.12 total_GB = weights + kv_cache + overhead
Bytes per parameter: FP16 = 2.0, Q8 = 1.0, Q6 = 0.75, Q5 = 0.625, Q4 = 0.5, Q3 = 0.4, Q2 = 0.3. These match typical GGUF file sizes. The KV cache uses FP16 (2 bytes) and scales linearly with context length — for long-context inference this can exceed the weight size.
Verdicts
- Fits — total memory ≤ available VRAM. Runs at full GPU speed.
- Offload — weights spill into system RAM via CPU offload. Usable but much slower (we apply a 4× penalty to the tok/s estimate).
- Won't run — even with RAM offload, the model won't fit.
Tokens per second estimate
Local LLM inference is memory-bandwidth-bound, not FLOPS-bound. Our rough estimate:
tok/s ≈ memory_bandwidth_GBs / active_weights_GB
For MoE models, we use active parameters (only experts activated per token count toward bandwidth). Real-world throughput varies ±30% with quantization kernels, batch size, and runtime (llama.cpp, vLLM, MLX). Treat these as ballpark, not promises.
Data sources
- Model metadata (params, context, license, architecture): Hugging Face model cards.
- Quantized file sizes: Ollama library GGUF builds.
- Benchmarks (MMLU-Pro, GPQA, IFEval, MATH, BBH, MuSR): Open LLM Leaderboard v2.
- Arena ELO: LMSYS Chatbot Arena.
- GPU specs (VRAM, memory bandwidth): manufacturer datasheets and TechPowerUp.
Feedback & contact
Found a bug, have a correction, or want to suggest a GPU or model we're missing? Email us at [email protected].
Limitations
- Inference only — training and fine-tuning need roughly 4–6× more memory.
- Batch size 1, single-user chat. Concurrent users need proportionally more KV cache.
- No speculative decoding or flash-attention discounts applied.
- Apple Silicon unified memory: we assume ~75% of total RAM is available to GPU.