vLLM
Production-grade LLM serving engine. PagedAttention for efficient KV cache, high throughput, multi-user API serving. For deployments, not single-user chat.
Local Llm Tool, Developer Tool
Yes
No
No
Yes — for local inference
Production LLM serving with high concurrency
Hard
Linux, Docker, CLI
Open source — free
vLLM is Production-grade LLM serving engine. PagedAttention for efficient KV cache, high throughput, multi-user API serving. For deployments, not single-user chat. vLLM is a production-focused LLM serving engine designed for high-throughput, multi-user deployments.
vLLM runs entirely on your local hardware. vLLM is open source (https://github.com/vllm-project/vllm), so you can inspect the code and self-host. NVIDIA GPU with CUDA required. Linux recommended. Same model VRAM requirements as other backends. vLLM's PagedAttention is more memory-efficient for long contexts and high concurrency.
Can it run on my hardware?
Minimum
NVIDIA GPU with CUDA required. Linux recommended. Same model VRAM requirements as other backends. vLLM's PagedAttention is more memory-efficient for long contexts and high concurrency.
Recommended
A100 40GB or 80GB for production deployments serving 70B models. Multiple RTX 3090/4090 GPUs for smaller-scale production. For single-user use, this is overkill — use Ollama or LM Studio instead.
Approximate VRAM needed for recommended local models at Q4 with 8K context:
| Model | Params | Q4 VRAM | Min GPU |
|---|---|---|---|
| Qwen3 32B | 32.8B | ~22.2 GB | 24 GB |
| Llama 3.1 70B Instruct | 70B | ~47.1 GB | 48 GB+ |
| Qwen3 235B-A22B (MoE) | 235B | ~149.9 GB | 48 GB+ |
| DeepSeek V3 671B | 671B | ~423.7 GB | 48 GB+ |
App compatibility
| Feature | Supported |
|---|---|
| Local models | Yes |
| OpenRouter | No |
| OpenAI-compatible API | Yes |
| Ollama | No |
| LM Studio | No |
| Anthropic API | No |
| Google API | No |
| Mistral API | No |
| Docker | Yes |
| Works offline | Yes |
| Needs GPU | Yes |
Recommended models
Local vs cloud: which should you use?
Use local models if
- You want privacy — data never leaves your machine
- You already have a GPU with sufficient VRAM
- You want zero per-token API costs
- You need offline access
- You have at least 16-24 GB VRAM for recommended models
Use cloud/API if
- Your GPU has insufficient VRAM for the models you need
- You want access to frontier model quality
- You need maximum coding/reasoning performance
- You don't want to manage local model downloads and updates
Setup overview
Setting up vLLM is complex and requires technical knowledge. It runs on linux, docker, cli. Full documentation is available at https://docs.vllm.ai.
Limitations
- Single-user interactive chat (overkill — use LM Studio or Ollama)
- Apple Silicon (difficult to set up)
- Beginners or quick experimentation
Related
Compatible models
Related apps
Frequently asked questions
- What is vLLM?
- vLLM is Production-grade LLM serving engine. PagedAttention for efficient KV cache, high throughput, multi-user API serving. For deployments, not single-user chat. vLLM is a production-focused LLM serving engine designed for high-throughput, multi-user deployments.
- Does vLLM need a GPU?
- NVIDIA GPU with CUDA required. Linux recommended. Same model VRAM requirements as other backends. vLLM's PagedAttention is more memory-efficient for long contexts and high concurrency.
- What models work best with vLLM?
- Models that work well with vLLM include: Qwen3 32B, Llama 3.1 70B Instruct, Qwen3 235B-A22B (MoE), DeepSeek V3 671B. The best model depends on your GPU's VRAM and your use case.
- Is vLLM free and open source?
- Yes. vLLM is open source and completely free. You can find the source code on GitHub at https://github.com/vllm-project/vllm.