Methodology
How CanItRun decides whether a model fits on your hardware, and where the data comes from.
Memory calculation
For a given model and quantization level:
weights_GB = params_B × bytes_per_param[quant] kv_cache_GB = 2 × layers × kv_heads × head_dim × context × 2 bytes overhead = (weights + kv) × 0.12 total_GB = weights + kv_cache + overhead
Bytes per parameter: FP16 = 2.0, Q8 = 1.0, Q6 = 0.75, Q5 = 0.625, Q4 = 0.5, Q3 = 0.4, Q2 = 0.3. These match typical GGUF file sizes. The KV cache uses FP16 (2 bytes) and scales linearly with context length — for long-context inference this can exceed the weight size.
Verdicts
- Fits — total memory ≤ available VRAM. Runs at full GPU speed.
- Offload — weights spill into system RAM via CPU offload. Usable but much slower (we apply a 4× penalty to the tok/s estimate).
- Won't run — even with RAM offload, the model won't fit.
Tokens per second estimate
Local LLM inference is memory-bandwidth-bound, not FLOPS-bound. Our rough estimate:
tok/s ≈ memory_bandwidth_GBs / active_weights_GB
For MoE models, we use active parameters (only experts activated per token count toward bandwidth). Real-world throughput varies ±30% with quantization kernels, batch size, and runtime (llama.cpp, vLLM, MLX). Treat these as ballpark, not promises.
Data sources
- Model metadata (params, context, license, architecture): Hugging Face model cards.
- Quantized file sizes: Ollama library GGUF builds.
- Benchmarks (MMLU-Pro, GPQA, IFEval, MATH, BBH, MuSR): Open LLM Leaderboard v2.
- Arena ELO: LMSYS Chatbot Arena.
- GPU specs (VRAM, memory bandwidth): manufacturer datasheets and TechPowerUp.
Limitations
- Inference only — training and fine-tuning need roughly 4–6× more memory.
- Batch size 1, single-user chat. Concurrent users need proportionally more KV cache.
- No speculative decoding or flash-attention discounts applied.
- Apple Silicon unified memory: we assume ~75% of total RAM is available to GPU.
Corrections
Found wrong numbers? Data updates are welcome via pull request. Model and GPU data live indata/models.ts and data/gpus.ts.
