CanItRun Logocanitrun.

Methodology

How CanItRun decides whether a model fits on your hardware, and where the data comes from.

Memory calculation

For a given model and quantization level:

weights_GB  = params_B × bytes_per_param[quant]
kv_cache_GB = 2 × layers × kv_heads × head_dim × context × 2 bytes
overhead    = (weights + kv) × 0.12
total_GB    = weights + kv_cache + overhead

Bytes per parameter: FP16 = 2.0, Q8 = 1.0, Q6 = 0.75, Q5 = 0.625, Q4 = 0.5, Q3 = 0.4, Q2 = 0.3. These match typical GGUF file sizes. The KV cache uses FP16 (2 bytes) and scales linearly with context length — for long-context inference this can exceed the weight size.

Verdicts

  • Fits — total memory ≤ available VRAM. Runs at full GPU speed.
  • Offload — weights spill into system RAM via CPU offload. Usable but much slower (we apply a 4× penalty to the tok/s estimate).
  • Won't run — even with RAM offload, the model won't fit.

Tokens per second estimate

Local LLM inference is memory-bandwidth-bound, not FLOPS-bound. Our rough estimate:

tok/s ≈ memory_bandwidth_GBs / active_weights_GB

For MoE models, we use active parameters (only experts activated per token count toward bandwidth). Real-world throughput varies ±30% with quantization kernels, batch size, and runtime (llama.cpp, vLLM, MLX). Treat these as ballpark, not promises.

Data sources

Limitations

  • Inference only — training and fine-tuning need roughly 4–6× more memory.
  • Batch size 1, single-user chat. Concurrent users need proportionally more KV cache.
  • No speculative decoding or flash-attention discounts applied.
  • Apple Silicon unified memory: we assume ~75% of total RAM is available to GPU.

Corrections

Found wrong numbers? Data updates are welcome via pull request. Model and GPU data live indata/models.ts and data/gpus.ts.