GLM-4.7 358B
GLM-4.7 358B needs roughly 203.9GB VRAM at Q4_K_M quantization (805.4GB at FP16). 10 GPUs we track can run it fully in VRAM at 8k context.
Z.ai358B params32B active (MoE)198k contextMITCommercial use ok
VRAM at each quantization
Assumes 8k context. KV cache grows linearly with context length.
| Quant | Weights | KV cache | Total |
|---|---|---|---|
| FP16 | 716.0 GB | 3.09 GB | 805.4 GB |
| Q8 | 358.0 GB | 3.09 GB | 404.4 GB |
| Q6_K | 268.5 GB | 3.09 GB | 304.2 GB |
| Q5_K_M | 223.8 GB | 3.09 GB | 254.1 GB |
| Q4_K_M | 179.0 GB | 3.09 GB | 203.9 GB |
| Q3_K_M | 143.2 GB | 3.09 GB | 163.8 GB |
| Q2_K | 107.4 GB | 3.09 GB | 123.8 GB |
Benchmarks
GPUs that run GLM-4.7 358B natively (10)
- NVIDIA DGX Spark (128GB)Q2_K · 31.3 t/s
- AMD Instinct MI300XQ3_K_M · 455.5 t/s
- AMD Strix Halo (128GB)Q2_K · 29.3 t/s
- Apple M4 Ultra (384GB)Q6_K · 50.1 t/s
- Apple M4 Ultra (192GB)Q3_K_M · 93.8 t/s
- Apple M4 Max (128GB)Q2_K · 62.6 t/s
- Apple M3 Max (128GB)Q2_K · 45.8 t/s
- Apple M2 Ultra (384GB)Q6_K · 36.7 t/s
- Apple M2 Ultra (192GB)Q3_K_M · 68.8 t/s
- Apple M1 Ultra (128GB)Q2_K · 91.7 t/s
Notes
Flagship 4.x model with interleaved/preserved/turn-level thinking modes. 200K context.