GLM-5 744B
GLM-5 744B needs roughly 480.9 GB VRAM at Q4_K_M quantization (1678.3 GB at FP16). 3 GPUs we track can run it fully in VRAM at 8k context.
3 GPUs run this natively · 0 with CPU offload
GLM-5 744B is a Mixture of Experts (MoE) model with 744B total parameters but only 40B active per token developed by Z.ai. February 2026 744B MoE with 40B active. Uses DeepSeek Sparse Attention (DSA) for efficient long-context.
To run GLM-5 744B locally: Q2_K ~250-300GB — datacenter-scale hardware required. As a MoE model, inference speed depends on active parameters (40B) rather than total size.
GPQA 86.0% — 256 routed experts, 8 active per token.
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP32 | 2976.0 GB | 10.47 GB | 3344.8 GB |
| BF16 | 1488.0 GB | 10.47 GB | 1678.3 GB |
| FP16 | 1488.0 GB | 10.47 GB | 1678.3 GB |
| Q8_0 | 744.0 GB | 10.47 GB | 845.0 GB |
| Q6_K | 610.1 GB | 10.47 GB | 695.0 GB |
| Q5_K_M | 479.1 GB | 10.47 GB | 548.4 GB |
| Q4_K_M | 418.9 GB | 10.47 GB | 480.9 GB |
| Q3_K_M | 319.9 GB | 10.47 GB | 370.0 GB |
| Q2_Krec | 244.8 GB | 10.47 GB | 285.9 GB |
| NVFP4cuda | 372.0 GB | 10.47 GB | 428.4 GB |
KV cache shown at 8k context (FP16). NVFP4 requires a CUDA GPU. Enable TurboQuant in the calculator to see reduced KV cache estimates.
Benchmarks
GPUs that run GLM-5 744B natively (3)
- Apple M4 Ultra (384GB)Q3_K_M · 69.8 t/s
- Apple M3 Ultra (512GB)Q4_K_M · 40 t/s
- Apple M2 Ultra (384GB)Q3_K_M · 51.2 t/s
Notes
Uses DeepSeek Sparse Attention (DSA) for efficient long-context. 256 routed experts, 8 active.
Frequently asked questions
- What are the VRAM requirements for GLM-5 744B?
- GLM-5 744B requires approximately 480.9 GB of VRAM at Q4_K_M quantization, 845.0 GB at Q8, and 1678.3 GB at FP16. These numbers assume 8k context window; VRAM scales linearly with context length due to the KV cache.
- How many parameters does GLM-5 744B have?
- GLM-5 744B has 744 billion total parameters, but only 40 billion are active per token thanks to its Mixture of Experts (MoE) architecture. This makes inference significantly faster than the total parameter count suggests.
- How capable is GLM-5 744B?
- GLM-5 744B achieves an MMLU-Pro score of 85.7, placing it among the most capable open-weight models available — competitive with frontier systems on general knowledge and reasoning.
- Can GLM-5 744B run on a 16 GB GPU?
- No. At Q4_K_M, GLM-5 744B needs 480.9 GB of VRAM — more than 16 GB. You will need a multi-GPU server.
- Can GLM-5 744B run on a 24 GB GPU?
- No. Even at Q4_K_M, GLM-5 744B needs 480.9 GB. Consider a multi-GPU server with 80 GB+ total VRAM.
- What is the smallest quantization for GLM-5 744B that fits in 24 GB of VRAM?
- GLM-5 744B cannot fit in 24 GB of VRAM at any standard quantization level. The minimum needed is 285.9 GB at Q2_K.
- What GPU do I need to run GLM-5 744B locally?
- You need a multi-GPU server. At Q4_K_M, GLM-5 744B needs 480.9 GB VRAM, more than any single consumer GPU. Consider 2–4× H100 or A100 GPUs.