How much VRAM does the NVIDIA RTX 5080 have?

The NVIDIA RTX 5080 has 16 GB of GDDR7 with 960 GB/s memory bandwidth.

What is the NVIDIA RTX 5080 best for?

With 16 GB of VRAM, the NVIDIA RTX 5080 handles smaller models (7B–14B) at Q4–Q5 quantization — ideal for entry-level local LLM experimentation and lightweight inference.

What LLMs can the NVIDIA RTX 5080 run locally?

The NVIDIA RTX 5080 can run 38 of the 84 open-weight models tracked by CanItRun natively in VRAM at 8k context. Top options include: Llama 3.1 8B Instruct at NVFP4, Llama 3.2 3B Instruct at BF16, Llama 3.2 1B Instruct at FP32.

Can the NVIDIA RTX 5080 run Llama 3.3 70B Instruct?

The NVIDIA RTX 5080 can run Llama 3.3 70B Instruct with CPU offload at Q3_K_M quantization, but inference will be slower than native VRAM execution.

Can the NVIDIA RTX 5080 run Qwen 3.6 27B?

Yes. The NVIDIA RTX 5080 runs Qwen 3.6 27B natively in VRAM at Q3_K_M quantization, achieving approximately 46.1 tokens per second.

Can the NVIDIA RTX 5080 run Llama 3.1 8B Instruct?

Yes. The NVIDIA RTX 5080 runs Llama 3.1 8B Instruct natively in VRAM at NVFP4 quantization, achieving approximately 123 tokens per second.

NVIDIA RTX 5080

The NVIDIA RTX 5080 has 16 GB VRAM and 960 GB/s memory bandwidth. It can run 38 of our 84 tracked models natively in VRAM at 8k context.

With 16 GB GDDR7, the NVIDIA RTX 5080 is a consumer-tier GPU that can run 38 models natively. It handles smaller models (7B–14B) at Q4–Q5 quantization.

The NVIDIA RTX 5080 is the second-tier Blackwell GPU, with 16GB GDDR7 on a 256-bit bus at 960 GB/s. Its 10,752 CUDA cores and 336 Tensor Cores make it a strong 1440p–4K gaming card, but the 16GB VRAM cap limits it to 7B–14B models for LLM inference without offloading.

NVIDIA RTX 5080: January 2025 Blackwell GB203 die with 16GB GDDR7 on a 256-bit bus at 960 GB/s — $999 MSRP, second-tier Blackwell consumer card.

7B-14B models fit natively at Q4-Q6. 27-32B models are tight or need CPU offload. ~12-18 t/s for 7B Q4.

Full CUDA support out of box. Same driver/toolkit requirements as the 5090; the 16GB cap is the binding constraint, not compute.

Vendor	NVIDIA
Architecture	Blackwell
VRAM	16 GB
Memory type	GDDR7
Memory bandwidth	960 GB/s
Compute backend	CUDA
Tier	Consumer
Released	2025
Models (native)	38 / 84
Models (offload)	13 / 84

Software: Full llama.cpp and Ollama support out of the box. CUDA 12.x recommended; driver ≥ 525 required.

Popular models for this GPU

Qwen 3.5 35B-A3B (MoE)Nemotron 3 Nano 30B Qwen3 30B-A3B (MoE)Gemma 2 27B Instruct Gemma 3 27B Instruct

Models this GPU runs natively in VRAM (38)

Show 33 more

Models that fit with CPU offload (13)

These use system RAM for layers that don't fit in VRAM — expect much slower inference.

Too large for this GPU (33)

Continue reading

hardware10 min

NVIDIA RTX 50 Series (Blackwell) for LLMs: Complete Guide

Frequently asked questions

How much VRAM does the NVIDIA RTX 5080 have?: The NVIDIA RTX 5080 has 16 GB of GDDR7 with 960 GB/s memory bandwidth.
What is the NVIDIA RTX 5080 best for?: With 16 GB of VRAM, the NVIDIA RTX 5080 handles smaller models (7B–14B) at Q4–Q5 quantization — ideal for entry-level local LLM experimentation and lightweight inference.
What LLMs can the NVIDIA RTX 5080 run locally?: The NVIDIA RTX 5080 can run 38 of the 84 open-weight models tracked by CanItRun natively in VRAM at 8k context. Top options include: Llama 3.1 8B Instruct at NVFP4, Llama 3.2 3B Instruct at BF16, Llama 3.2 1B Instruct at FP32.
Can the NVIDIA RTX 5080 run Llama 3.3 70B Instruct?: The NVIDIA RTX 5080 can run Llama 3.3 70B Instruct with CPU offload at Q3_K_M quantization, but inference will be slower than native VRAM execution.
Can the NVIDIA RTX 5080 run Qwen 3.6 27B?: Yes. The NVIDIA RTX 5080 runs Qwen 3.6 27B natively in VRAM at Q3_K_M quantization, achieving approximately 46.1 tokens per second.
Can the NVIDIA RTX 5080 run Llama 3.1 8B Instruct?: Yes. The NVIDIA RTX 5080 runs Llama 3.1 8B Instruct natively in VRAM at NVFP4 quantization, achieving approximately 123 tokens per second.