Ollama
The industry standard for running LLMs locally. Simple CLI, massive model library (100K+), OpenAI-compatible API on port 11434. Powers Open WebUI, Continue, and more.
Local Llm Tool, Developer Tool
Yes
No
Yes
Only for local models
Running LLMs locally as a backend for other apps
Easy
macOS, Linux, Windows, CLI, Docker
Open source — free
Ollama is The industry standard for running LLMs locally. Simple CLI, massive model library (100K+), OpenAI-compatible API on port 11434. Powers Open WebUI, Continue, and more. Ollama is the most popular local LLM runtime with 120K+ GitHub stars.
Ollama runs entirely on your local hardware. Ollama integration lets you run models locally on your own GPU. Ollama is open source (https://github.com/ollama/ollama), so you can inspect the code and self-host. No GPU required — runs on CPU for small models (3B-8B) with sufficient system RAM. For 7B models, 8 GB VRAM recommended for usable speeds. Default context window is only 2K — increase it for coding agents.
Can it run on my hardware?
Minimum
No GPU required — runs on CPU for small models (3B-8B) with sufficient system RAM. For 7B models, 8 GB VRAM recommended for usable speeds. Default context window is only 2K — increase it for coding agents.
Recommended
8 GB VRAM: 7B models at Q8. 12 GB VRAM: 13-14B at Q4. 16 GB VRAM: 14B at Q8 or MoE models. 24 GB VRAM (RTX 3090/4090): 27-32B at Q4 or 70B at Q2. 48 GB+ (dual GPU): 70B at Q4 or 235B MoE at IQ4.
Approximate VRAM needed for recommended local models at Q4 with 8K context:
| Model | Params | Q4 VRAM | Min GPU |
|---|---|---|---|
| Qwen3 32B | 32.8B | ~22.2 GB | 24 GB |
| Qwen3 14B | 14.8B | ~10.8 GB | 12 GB |
| Qwen 2.5 7B Instruct | 7.6B | ~5.3 GB | 8 GB |
| Llama 3.1 8B Instruct | 8B | ~6.3 GB | 8 GB |
| Gemma 3 12B Instruct | 12.2B | ~8.9 GB | 12 GB |
| Mistral Nemo 12B Instruct | 12.2B | ~9.2 GB | 12 GB |
| DeepSeek R1 Distill Qwen 32B | 32.5B | ~22.9 GB | 24 GB |
| Llama 3.1 70B Instruct | 70B | ~47.1 GB | 48 GB+ |
App compatibility
| Feature | Supported |
|---|---|
| Local models | Yes |
| OpenRouter | No |
| OpenAI-compatible API | Yes |
| Ollama | Yes |
| LM Studio | No |
| Anthropic API | No |
| Google API | No |
| Mistral API | No |
| Docker | Yes |
| Works offline | Yes |
| Needs GPU | No |
Recommended models
Best local models
Qwen3 32B
32.8B params · ~22.2 GB at Q4 · Dense
Qwen3 14B
14.8B params · ~10.8 GB at Q4 · Dense
Qwen 2.5 7B Instruct
7.6B params · ~5.3 GB at Q4 · Dense
Llama 3.1 8B Instruct
8B params · ~6.3 GB at Q4 · Dense
Gemma 3 12B Instruct
12.2B params · ~8.9 GB at Q4 · Dense
Mistral Nemo 12B Instruct
12.2B params · ~9.2 GB at Q4 · Dense
DeepSeek R1 Distill Qwen 32B
32.5B params · ~22.9 GB at Q4 · Dense
Llama 3.1 70B Instruct
70B params · ~47.1 GB at Q4 · Dense
Local vs cloud: which should you use?
Use local models if
- You want privacy — data never leaves your machine
- You already have a GPU with sufficient VRAM
- You want zero per-token API costs
- You need offline access
Use cloud/API if
- Your GPU has insufficient VRAM for the models you need
- You want access to frontier model quality
- You need maximum coding/reasoning performance
- You don't want to manage local model downloads and updates
Setup overview
Setting up Ollama is straightforward. It runs on macos, linux, windows, cli, docker. Full documentation is available at https://github.com/ollama/ollama/tree/main/docs.
Limitations
- GUI-only users (CLI tool — pair with Open WebUI for a GUI)
- Maximum performance (raw llama.cpp is 10-20% faster)
- Cloud/API model access (local inference only)
- RAG or document Q&A on its own
Related
Recommended GPUs
Compatible models
Related apps
Frequently asked questions
- What is Ollama?
- Ollama is The industry standard for running LLMs locally. Simple CLI, massive model library (100K+), OpenAI-compatible API on port 11434. Powers Open WebUI, Continue, and more. Ollama is the most popular local LLM runtime with 120K+ GitHub stars.
- Does Ollama need a GPU?
- Ollama itself does not require a GPU. However, the models you connect to it do. No GPU required — runs on CPU for small models (3B-8B) with sufficient system RAM. For 7B models, 8 GB VRAM recommended for usable speeds. Default context window is only 2K — increase it for coding agents.
- Can I run Ollama on CPU only?
- Yes — Ollama supports CPU-only operation, but performance will be significantly slower (5-10x) compared to GPU inference. CPU-only works best for models under 7B parameters with at least 16 GB of system RAM.
- Can Ollama use local models via Ollama?
- Yes. Ollama works with Ollama for running models locally. Install Ollama, pull your model (e.g., `ollama pull qwen2.5:7b`), and connect Ollama to the local Ollama server. GPU requirements depend on the model you choose, not Ollama itself.
- What models work best with Ollama?
- Models that work well with Ollama include: Qwen3 32B, Qwen3 14B, Qwen 2.5 7B Instruct, Llama 3.1 8B Instruct, Gemma 3 12B Instruct, Mistral Nemo 12B Instruct. The best model depends on your GPU's VRAM and your use case.
- Is Ollama free and open source?
- Yes. Ollama is open source and completely free. You can find the source code on GitHub at https://github.com/ollama/ollama.