CanItRun Logocanitrun.

Mixtral 8x7B Instruct v0.1

Mixtral 8x7B Instruct v0.1 needs roughly 30.6 GB VRAM at Q4_K_M quantization (105.8 GB at FP16). 67 GPUs we track can run it fully in VRAM at 8k context.

67 GPUs run this natively · 27 with CPU offload

Mistral AI46.7B params12.9B active (MoE)32k contextApache 2.0Commercial use ok

Mixtral 8x7B Instruct v0.1 is a Mixture of Experts (MoE) model with 46.7B total parameters but only 12.9B active per token developed by Mistral AI. December 2023 MoE pioneer — 46.7B total parameters with ~13B active per token via 8 experts and top-2 routing.

To run Mixtral 8x7B Instruct v0.1 locally: Memory costs proportional to 47B (not 13B active). Q4_K_M ~28-32GB — fits on 24GB GPU with tight context or 32GB+ recommended. Community papers show successful mixed quantization on desktop hardware. As a MoE model, inference speed depends on active parameters (12.9B) rather than total size.

Outperforms Llama-2-70B and GPT-3.5 on most benchmarks. MMLU 70.6%, GSM8K 74.4%. MT Bench 8.30 beats GPT-3.5 Turbo.

VRAM at each quantization

Assumes 8k context. KV cache grows linearly with context length.

QuantWeightsKV cacheTotal
FP32186.8 GB1.07 GB210.4 GB
BF1693.4 GB1.07 GB105.8 GB
FP1693.4 GB1.07 GB105.8 GB
Q8_046.7 GB1.07 GB53.5 GB
Q6_K38.3 GB1.07 GB44.1 GB
Q5_K_M30.1 GB1.07 GB34.9 GB
Q4_K_Mrec26.3 GB1.07 GB30.6 GB
Q3_K_M20.1 GB1.07 GB23.7 GB
Q2_K15.4 GB1.07 GB18.4 GB
NVFP4cuda23.4 GB1.07 GB27.4 GB

KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.

Benchmarks

GPUs that run Mixtral 8x7B Instruct v0.1 natively (67)

Plus 27 GPUs that run it with CPU offload (slower)

Notes

MoE: 47B total params, only ~13B active per token — fast if it fits.

Hugging Face ↗Ollama ↗Released 2023-12-11

Compare Mixtral 8x7B Instruct v0.1 with other models

Frequently asked questions

What are the VRAM requirements for Mixtral 8x7B Instruct v0.1?
Mixtral 8x7B Instruct v0.1 requires approximately 30.6 GB of VRAM at Q4_K_M quantization, 53.5 GB at Q8, and 105.8 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
How many parameters does Mixtral 8x7B Instruct v0.1 have?
Mixtral 8x7B Instruct v0.1 has 46.7 billion total parameters, but only 12.9 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
How capable is Mixtral 8x7B Instruct v0.1?
Mixtral 8x7B Instruct v0.1 has an MMLU-Pro score of 29.7, making it well-suited for lightweight tasks, prototyping, and resource-constrained environments.
Can Mixtral 8x7B Instruct v0.1 run on a 16 GB GPU?
No. At Q4_K_M, Mixtral 8x7B Instruct v0.1 needs 30.6 GB of VRAM — more than 16 GB. You will need a 48 GB GPU like the RTX 6000 Ada or a dual-GPU setup.
Can Mixtral 8x7B Instruct v0.1 run on a 24 GB GPU?
No. Even at Q4_K_M, Mixtral 8x7B Instruct v0.1 needs 30.6 GB. Consider a 48 GB card like the RTX 6000 Ada or a dual RTX 4090 setup.
What is the smallest quantization for Mixtral 8x7B Instruct v0.1 that fits in 24 GB of VRAM?
At Q3_K_M, Mixtral 8x7B Instruct v0.1 needs 23.7 GB — the highest-quality quantization that fits in 24 GB of VRAM.
What GPU do I need to run Mixtral 8x7B Instruct v0.1 locally?
You need a 48 GB GPU or a dual-GPU setup. At Q4_K_M, Mixtral 8x7B Instruct v0.1 needs 30.6 GB VRAM. Options: RTX 6000 Ada (48 GB), A6000 (48 GB), or 2× RTX 4090.