Phi-3.5 Mini Instruct
Phi-3.5 Mini Instruct needs roughly 6.0 GB VRAM at Q4_K_M quantization (12.1 GB at FP16). 105 GPUs we track can run it fully in VRAM at 8k context.
105 GPUs run this natively · 2 with CPU offload
Phi-3.5 Mini Instruct is a 3.8B parameter dense model developed by Microsoft. August 2024 3.8B model with 128K context — no GQA means larger KV cache.
To run Phi-3.5 Mini Instruct locally: Q6_K ~4GB — runs on 8GB GPUs. Note: full KV heads increase context memory.
MMLU-Pro 35.6% is strong for sub-4B. HumanEval 62.8%.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 15.2 GB | 3.22 GB | 20.6 GB |
| BF16 | 7.6 GB | 3.22 GB | 12.1 GB |
| FP16 | 7.6 GB | 3.22 GB | 12.1 GB |
| Q8_0 | 3.8 GB | 3.22 GB | 7.9 GB |
| Q6_Krec | 3.1 GB | 3.22 GB | 7.1 GB |
| Q5_K_M | 2.5 GB | 3.22 GB | 6.3 GB |
| Q4_K_M | 2.1 GB | 3.22 GB | 6.0 GB |
| Q3_K_M | 1.6 GB | 3.22 GB | 5.4 GB |
| Q2_K | 1.3 GB | 3.22 GB | 5.0 GB |
| NVFP4cuda | 1.9 GB | 3.22 GB | 5.7 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Phi-3.5 Mini Instruct natively (105)
- NVIDIA RTX 5090FP32 · 117.9 t/s
- NVIDIA RTX 5080BF16 · 126.3 t/s
- NVIDIA RTX 5070 TiBF16 · 117.9 t/s
- NVIDIA RTX 5070NVFP4 · 353.7 t/s
- NVIDIA RTX 5060 Ti 16GBBF16 · 58.9 t/s
- NVIDIA RTX 5060NVFP4 · 235.8 t/s
- NVIDIA RTX 5050NVFP4 · 168.4 t/s
- NVIDIA RTX 4090FP32 · 66.3 t/s
- NVIDIA RTX 4080BF16 · 94.3 t/s
- NVIDIA RTX 4070 TiNVFP4 · 265.3 t/s
- NVIDIA RTX 4070NVFP4 · 265.3 t/s
- NVIDIA RTX 4060 Ti 16GBBF16 · 37.9 t/s
- NVIDIA RTX 4060NVFP4 · 143.2 t/s
- NVIDIA RTX 3090FP32 · 61.6 t/s
- NVIDIA RTX 3090 TiFP32 · 66.3 t/s
- NVIDIA RTX 3080 10GBNVFP4 · 400 t/s
- NVIDIA RTX 3060 12GBNVFP4 · 189.5 t/s
- NVIDIA H100 80GBFP32 · 220.4 t/s
- NVIDIA A100 80GBFP32 · 134.1 t/s
- NVIDIA A100 40GBFP32 · 102.3 t/s
- NVIDIA L40SFP32 · 56.8 t/s
- NVIDIA RTX A6000FP32 · 50.5 t/s
- NVIDIA RTX 4000 AdaBF16 · 42.1 t/s
- NVIDIA RTX 4500 AdaFP32 · 28.4 t/s
- NVIDIA RTX 5000 AdaFP32 · 37.9 t/s
- NVIDIA RTX 6000 AdaFP32 · 63.2 t/s
- NVIDIA RTX Pro 6000FP32 · 88.4 t/s
- NVIDIA DGX Spark (128GB)FP32 · 18 t/s
- AMD Radeon RX 7900 XTXFP32 · 63.2 t/s
- AMD Radeon RX 7900 XTBF16 · 105.3 t/s
- AMD Radeon RX 7900 GREBF16 · 75.8 t/s
- AMD Radeon RX 6800 XTBF16 · 67.4 t/s
- AMD Radeon PRO W7800FP32 · 37.9 t/s
- AMD Radeon PRO W7900FP32 · 56.8 t/s
- AMD Instinct MI300XFP32 · 348.7 t/s
- AMD Radeon AI Pro 9700 32GBFP32 · 42.1 t/s
- AMD Strix Halo (128GB)FP32 · 16.8 t/s
- AMD Strix Halo (96GB)FP32 · 16.8 t/s
- AMD Strix Halo (64GB)FP32 · 16.8 t/s
- Apple M5 Max (128GB)FP32 · 40.4 t/s
- Apple M5 Max (64GB)FP32 · 40.4 t/s
- Apple M5 Max (48GB)FP32 · 40.4 t/s
- Apple M5 Pro (48GB)FP32 · 20.2 t/s
- Apple M5 Pro (36GB)FP32 · 20.2 t/s
- Apple M5 Pro (24GB)BF16 · 40.4 t/s
- Apple M5 (32GB)FP32 · 10.1 t/s
- Apple M5 (16GB)Q8_0 · 40.3 t/s
- Apple M4 Ultra (384GB)FP32 · 71.8 t/s
- Apple M4 Ultra (192GB)FP32 · 71.8 t/s
- Apple M4 Max (128GB)FP32 · 35.9 t/s
- Apple M4 Max (96GB)FP32 · 35.9 t/s
- Apple M4 Max (64GB)FP32 · 35.9 t/s
- Apple M4 Max (48GB)FP32 · 35.9 t/s
- Apple M4 Pro (48GB)FP32 · 18 t/s
- Apple M4 Pro (24GB)BF16 · 35.9 t/s
- Apple M4 (32GB)FP32 · 7.9 t/s
- Apple M4 (16GB)Q8_0 · 31.6 t/s
- Apple M3 Ultra (512GB)FP32 · 53.9 t/s
- Apple M3 Ultra (256GB)FP32 · 53.9 t/s
- Apple M3 Ultra (96GB)FP32 · 53.9 t/s
- Apple M3 Max (128GB)FP32 · 26.3 t/s
- Apple M3 Max (96GB)FP32 · 26.3 t/s
- Apple M3 Max (64GB)FP32 · 26.3 t/s
- Apple M3 Max (48GB)FP32 · 26.3 t/s
- Apple M3 Max (36GB)FP32 · 26.3 t/s
- Apple M3 Pro (36GB)FP32 · 9.9 t/s
- Apple M3 Pro (18GB)BF16 · 19.7 t/s
- Apple M3 (24GB)BF16 · 13.2 t/s
- Apple M3 (16GB)Q8_0 · 26.3 t/s
- Apple M3 (8GB)Q3_K_M · 61.2 t/s
- Apple M2 Ultra (384GB)FP32 · 52.6 t/s
- Apple M2 Ultra (192GB)FP32 · 52.6 t/s
- Apple M2 Max (96GB)FP32 · 26.3 t/s
- Apple M2 Max (64GB)FP32 · 26.3 t/s
- Apple M2 Max (32GB)FP32 · 26.3 t/s
- Apple M2 Pro (32GB)FP32 · 13.2 t/s
- Apple M2 Pro (16GB)Q8_0 · 52.6 t/s
- Apple M2 (24GB)BF16 · 13.2 t/s
- Apple M2 (16GB)Q8_0 · 26.3 t/s
- Apple M2 (8GB)Q3_K_M · 61.2 t/s
- Apple M1 Ultra (128GB)FP32 · 52.6 t/s
- Apple M1 Ultra (64GB)FP32 · 52.6 t/s
- Apple M1 Max (64GB)FP32 · 26.3 t/s
- Apple M1 Max (32GB)FP32 · 26.3 t/s
- Apple M1 Pro (32GB)FP32 · 13.2 t/s
- Apple M1 Pro (16GB)Q8_0 · 52.6 t/s
- Apple M1 (16GB)Q8_0 · 17.9 t/s
- Apple M1 (8GB)Q3_K_M · 41.6 t/s
- Intel Arc B580 12GBQ8_0 · 120 t/s
- Intel Arc B570 10GBQ8_0 · 100 t/s
- Intel Arc Pro B70 24GBFP32 · 30 t/s
- Intel Arc Pro B60 24GBFP32 · 25 t/s
- Intel Arc A770 16GBBF16 · 73.7 t/s
- Intel Arc A770 8GBQ6_K · 164.3 t/s
- Intel Arc A750 8GBQ6_K · 164.3 t/s
- Intel Arc A580 8GBQ6_K · 164.3 t/s
- Intel Arc A380 6GBQ3_K_M · 113.8 t/s
- Intel Arc Pro A60 12GBQ8_0 · 101.1 t/s
- Intel Arc Pro A50 6GBQ3_K_M · 117.5 t/s
- Intel Arc Pro A40 6GBQ3_K_M · 117.5 t/s
- Intel Data Center GPU Max 1550FP32 · 215.5 t/s
- Intel Data Center GPU Max 1100FP32 · 80.9 t/s
- Intel Arc 140V (32GB)FP32 · 9 t/s
- Intel Arc 140V (16GB)Q8_0 · 36.1 t/s
- Intel Arc 130V (16GB)Q8_0 · 36.1 t/s
Plus 2 GPUs that run it with CPU offload (slower)
- Intel Arc A310 4GBFP32 · 2 t/s
- CPU only (system RAM)FP32 · 0.7 t/s
Notes
No GQA — full KV heads means large KV cache at long context.
Compare Phi-3.5 Mini Instruct with other models
Frequently asked questions
- What are the VRAM requirements for Phi-3.5 Mini Instruct?
- Phi-3.5 Mini Instruct requires approximately 6.0 GB of VRAM at Q4_K_M quantization, 7.9 GB at Q8, and 12.1 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Phi-3.5 Mini Instruct have?
- Phi-3.5 Mini Instruct has 3.8 billion parameters.
- How capable is Phi-3.5 Mini Instruct?
- Phi-3.5 Mini Instruct has an MMLU-Pro score of 47.4, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can Phi-3.5 Mini Instruct run on a 16 GB GPU?
- Yes. Phi-3.5 Mini Instruct needs 6.0 GB at Q4_K_M, which fits in a 16 GB GPU like the RTX 4080 or RTX 4070 Ti Super.
- What is the smallest quantization for Phi-3.5 Mini Instruct that fits in 24 GB of VRAM?
- At FP32, Phi-3.5 Mini Instruct needs 20.6 GB — the highest-quality quantization that fits in 24 GB of VRAM.
- What GPU do I need to run Phi-3.5 Mini Instruct locally?
- A 16 GB GPU is enough. At Q4_K_M, Phi-3.5 Mini Instruct needs 6.0 GB VRAM. Good options: RTX 4080 (16 GB), RTX 4070 Ti Super (16 GB).