llama.cpp
The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline.
Local Llm Tool, Developer Tool
Yes
No
No
Only for local models
Maximum local inference performance (10-20% faster than Ollama)
Medium
macOS, Linux, Windows, CLI
Open source — free
llama.cpp is The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline. llama.cpp is the open-source inference engine that powers Ollama, LM Studio, and most local LLM tools.
llama.cpp runs entirely on your local hardware. llama.cpp is open source (https://github.com/ggerganov/llama.cpp), so you can inspect the code and self-host. Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).
Can it run on my hardware?
Minimum
Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).
Recommended
24 GB VRAM (RTX 3090/4090) for Qwen 2.5 Coder 32B at Q4 with 64K context. For GPU-poor setups, CPU-only with 32 GB+ system RAM works for 7-14B models at acceptable speeds. Use llama-bench to find optimal thread/batch settings for your hardware.
Approximate VRAM needed for recommended local models at Q4 with 8K context:
| Model | Params | Q4 VRAM | Min GPU |
|---|---|---|---|
| Qwen3 32B | 32.8B | ~22.2 GB | 24 GB |
| Qwen3 30B-A3B (MoE) | 30B | ~19.8 GB | 24 GB |
| Qwen 2.5 Coder 32B Instruct | 32.5B | ~22.9 GB | 24 GB |
| GPT-OSS 20B | 21B | ~13.7 GB | 16 GB |
| Llama 3.1 70B Instruct | 70B | ~47.1 GB | 48 GB+ |
App compatibility
| Feature | Supported |
|---|---|
| Local models | Yes |
| OpenRouter | No |
| OpenAI-compatible API | Yes |
| Ollama | No |
| LM Studio | No |
| Anthropic API | No |
| Google API | No |
| Mistral API | No |
| Docker | No |
| Works offline | Yes |
| Needs GPU | No |
Recommended models
Local vs cloud: which should you use?
Use local models if
- You want privacy — data never leaves your machine
- You already have a GPU with sufficient VRAM
- You want zero per-token API costs
- You need offline access
Use cloud/API if
- Your GPU has insufficient VRAM for the models you need
- You want access to frontier model quality
- You need maximum coding/reasoning performance
- You don't want to manage local model downloads and updates
Setup overview
Setting up llama.cpp is moderate in complexity. It runs on macos, linux, windows, cli.
Limitations
- Beginners (use Ollama or LM Studio instead)
- Quick model swapping (no built-in model management — use llama-swap)
- GUI-only users (CLI tool — pair with a frontend)
Related
Compatible models
Related apps
Frequently asked questions
- What is llama.cpp?
- llama.cpp is The engine underneath Ollama — but faster. Full control over quants, context, and grammars. Grammar file support enables GPT-OSS tool calling in Cline. llama.cpp is the open-source inference engine that powers Ollama, LM Studio, and most local LLM tools.
- Does llama.cpp need a GPU?
- llama.cpp itself does not require a GPU. However, the models you connect to it do. Same VRAM requirements as Ollama for equivalent models, but 10-20% faster. Grammar file support lets you run GPT-OSS-20B with Cline tool calling at 16 GB VRAM (MXFP4 quant).
- Can I run llama.cpp on CPU only?
- Yes — llama.cpp supports CPU-only operation, but performance will be significantly slower (5-10x) compared to GPU inference. CPU-only works best for models under 7B parameters with at least 16 GB of system RAM.
- What models work best with llama.cpp?
- Models that work well with llama.cpp include: Qwen3 32B, Qwen3 30B-A3B (MoE), Qwen 2.5 Coder 32B Instruct, GPT-OSS 20B, Llama 3.1 70B Instruct. The best model depends on your GPU's VRAM and your use case.
- Is llama.cpp free and open source?
- Yes. llama.cpp is open source and completely free. You can find the source code on GitHub at https://github.com/ggerganov/llama.cpp.