Mixtral 8x22B Instruct v0.1
Mixtral 8x22B Instruct v0.1 needs roughly 91.0 GB VRAM at Q4_K_M quantization (317.9 GB at FP16). 29 GPUs we track can run it fully in VRAM at 8k context.
29 GPUs run this natively · 10 with CPU offload
Mixtral 8x22B Instruct v0.1 is a Mixture of Experts (MoE) model with 141B total parameters but only 39B active per token developed by Mistral AI. April 2024 large MoE — 141B total parameters with ~39B active. 64K context.
To run Mixtral 8x22B Instruct v0.1 locally: Q4_K_M ~80-90GB — requires 80GB GPU or dual 48GB. Multi-GPU or Mac Studio territory. As a MoE model, inference speed depends on active parameters (39B) rather than total size.
Significant upgrade over 8x7B in reasoning, coding, and knowledge. MMLU-Pro 40.0%, HumanEval 76.2%.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 564.0 GB | 1.88 GB | 633.8 GB |
| BF16 | 282.0 GB | 1.88 GB | 317.9 GB |
| FP16 | 282.0 GB | 1.88 GB | 317.9 GB |
| Q8_0 | 141.0 GB | 1.88 GB | 160.0 GB |
| Q6_K | 115.6 GB | 1.88 GB | 131.6 GB |
| Q5_K_M | 90.8 GB | 1.88 GB | 103.8 GB |
| Q4_K_Mrec | 79.4 GB | 1.88 GB | 91.0 GB |
| Q3_K_M | 60.6 GB | 1.88 GB | 70.0 GB |
| Q2_K | 46.4 GB | 1.88 GB | 54.1 GB |
| NVFP4cuda | 70.5 GB | 1.88 GB | 81.1 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run Mixtral 8x22B Instruct v0.1 natively (29)
- NVIDIA H100 80GBQ3_K_M · 219.7 t/s
- NVIDIA A100 80GBQ3_K_M · 133.7 t/s
- NVIDIA RTX Pro 6000NVFP4 · 75.8 t/s
- NVIDIA DGX Spark (128GB)NVFP4 · 15.4 t/s
- AMD Instinct MI300XQ8_0 · 149.5 t/s
- AMD Strix Halo (128GB)Q5_K_M · 11.2 t/s
- AMD Strix Halo (96GB)Q4_K_M · 12.8 t/s
- AMD Strix Halo (64GB)Q2_K · 21.9 t/s
- Apple M5 Max (128GB)Q5_K_M · 26.9 t/s
- Apple M5 Max (64GB)Q2_K · 52.6 t/s
- Apple M4 Ultra (384GB)BF16 · 15.4 t/s
- Apple M4 Ultra (192GB)Q8_0 · 30.8 t/s
- Apple M4 Max (128GB)Q5_K_M · 23.9 t/s
- Apple M4 Max (96GB)Q4_K_M · 27.4 t/s
- Apple M4 Max (64GB)Q2_K · 46.8 t/s
- Apple M3 Ultra (512GB)BF16 · 11.6 t/s
- Apple M3 Ultra (256GB)Q8_0 · 23.1 t/s
- Apple M3 Ultra (96GB)Q4_K_M · 41 t/s
- Apple M3 Max (128GB)Q5_K_M · 17.5 t/s
- Apple M3 Max (96GB)Q4_K_M · 20 t/s
- Apple M3 Max (64GB)Q2_K · 34.3 t/s
- Apple M2 Ultra (384GB)BF16 · 11.3 t/s
- Apple M2 Ultra (192GB)Q8_0 · 22.6 t/s
- Apple M2 Max (96GB)Q4_K_M · 20 t/s
- Apple M2 Max (64GB)Q2_K · 34.3 t/s
- Apple M1 Ultra (128GB)Q5_K_M · 35 t/s
- Apple M1 Ultra (64GB)Q2_K · 68.6 t/s
- Apple M1 Max (64GB)Q2_K · 34.3 t/s
- Intel Data Center GPU Max 1550Q5_K_M · 143.5 t/s
Plus 10 GPUs that run it with CPU offload (slower)
- NVIDIA RTX 5090Q2_K · 34.9 t/s
- NVIDIA A100 40GBQ2_K · 30.3 t/s
- NVIDIA L40SQ3_K_M · 12.9 t/s
- NVIDIA RTX A6000Q3_K_M · 11.4 t/s
- NVIDIA RTX 5000 AdaQ2_K · 11.2 t/s
- NVIDIA RTX 6000 AdaQ3_K_M · 14.3 t/s
- AMD Radeon PRO W7800Q2_K · 11.2 t/s
- AMD Radeon PRO W7900Q3_K_M · 12.9 t/s
- AMD Radeon AI Pro 9700 32GBQ2_K · 12.5 t/s
- Intel Data Center GPU Max 1100Q3_K_M · 18.3 t/s
Notes
MoE: 141B total / 39B active — needs a lot of VRAM but runs fast when it fits.
Compare Mixtral 8x22B Instruct v0.1 with other models
Frequently asked questions
- What are the VRAM requirements for Mixtral 8x22B Instruct v0.1?
- Mixtral 8x22B Instruct v0.1 requires approximately 91.0 GB of VRAM at Q4_K_M quantization, 160.0 GB at Q8, and 317.9 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does Mixtral 8x22B Instruct v0.1 have?
- Mixtral 8x22B Instruct v0.1 has 141 billion total parameters, but only 39 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is Mixtral 8x22B Instruct v0.1?
- Mixtral 8x22B Instruct v0.1 has an MMLU-Pro score of 40, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
- Can Mixtral 8x22B Instruct v0.1 run on a 16 GB GPU?
- No. At Q4_K_M, Mixtral 8x22B Instruct v0.1 needs 91.0 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can Mixtral 8x22B Instruct v0.1 run on a 24 GB GPU?
- No. Even at Q4_K_M, Mixtral 8x22B Instruct v0.1 needs 91.0 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for Mixtral 8x22B Instruct v0.1 that fits in 24 GB of VRAM?
- Mixtral 8x22B Instruct v0.1 cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 54.1 GB at Q2_K.
- What GPU do I need to run Mixtral 8x22B Instruct v0.1 locally?
- You need a multi-GPU server. At Q4_K_M, Mixtral 8x22B Instruct v0.1 needs 91.0 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.