CanItRun Logocanitrun.
← All apps

llama.cpp

The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline.

App type

Local Llm Tool, Developer Tool

Local models

Yes

OpenRouter

No

Ollama

No

GPU required

Only for local models

Best for

Maximum local inference performance (10-20% faster than Ollama)

Setup difficulty

Medium

Platforms

macOS, Linux, Windows, CLI

Pricing

Open source — free

llama.cpp is The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline. llama.cpp is the open-source inference engine that powers Ollama, LM Studio, and most local LLM tools.

llama.cpp runs entirely on your local hardware. llama.cpp is open source (https://github.com/ggerganov/llama.cpp), so you can inspect the code and self-host. Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).

Can it run on my hardware?

Minimum

Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).

Recommended

24 GB VRAM (RTX 3090/4090) for Qwen 2.5 Coder 32B at Q4 with 64K context. For GPU-poor setups, CPU-only with 32 GB+ system RAM works for 7-14B models at acceptable speeds. Use llama-bench to find optimal thread/batch settings for your hardware.

Approximate VRAM needed for recommended local models at Q4 with 8K context:

ModelParamsQ4 VRAMMin GPU
Qwen3 32B32.8B~22.2 GB24 GB
Qwen3 30B-A3B (MoE)30B~19.8 GB24 GB
Qwen 2.5 Coder 32B Instruct32.5B~22.9 GB24 GB
GPT-OSS 20B21B~13.7 GB16 GB
Llama 3.1 70B Instruct70B~47.1 GB48 GB+

Check your GPU against these models in the calculator →

App compatibility

FeatureSupported
Local modelsYes
OpenRouterNo
OpenAI-compatible APIYes
OllamaNo
LM StudioNo
Anthropic APINo
Google APINo
Mistral APINo
DockerNo
Works offlineYes
Needs GPUNo

Recommended models

Best local models

Local vs cloud: which should you use?

Use local models if

  • You want privacy — data never leaves your machine
  • You already have a GPU with sufficient VRAM
  • You want zero per-token API costs
  • You need offline access

Use cloud/API if

  • Your GPU has insufficient VRAM for the models you need
  • You want access to frontier model quality
  • You need maximum coding/reasoning performance
  • You don't want to manage local model downloads and updates

Setup overview

Setting up llama.cpp is moderate in complexity. It runs on macos, linux, windows, cli.

Limitations

  • Beginners (use Ollama or LM Studio instead)
  • Quick model swapping (no built-in model management — use llama-swap)
  • GUI-only users (CLI tool — pair with a frontend)

Related

Recommended GPUs

Compatible models

Related apps

Frequently asked questions

What is llama.cpp?
llama.cpp is The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline. llama.cpp is the open-source inference engine that powers Ollama, LM Studio, and most local LLM tools.
Does llama.cpp need a GPU?
llama.cpp itself does not require a GPU. However, the models you connect to it do. Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).
Can I run llama.cpp on CPU only?
Yes — llama.cpp supports CPU-only operation, but performance will be significantly slower (5-10x) compared to GPU inference. CPU-only works best for models under 7B parameters with at least 16 GB of system RAM.
What models work best with llama.cpp?
Models that work well with llama.cpp include: Qwen3 32B, Qwen3 30B-A3B (MoE), Qwen 2.5 Coder 32B Instruct, GPT-OSS 20B, Llama 3.1 70B Instruct. The best model depends on your GPU's VRAM and your use case.
Is llama.cpp free and open source?
Yes. llama.cpp is open source and completely free. You can find the source code on GitHub at https://github.com/ggerganov/llama.cpp.