Skip to content

TODAY

Zhipu Reveals Why GLM-5 Got Dumber Under Heavy Load

Zhipu published a long technical post explaining why GLM-5 produced garbled output, repetitive generation, and rare-character errors at "hundreds of millions of daily calls" scale. Engineers traced two issues: cross-request KV cache races in their prefill-decode-separated architecture, and async loading timing bugs in HiCache. After fixes, anomaly rates dropped from ~0.1% to ~0.03%. They also introduced LayerSplit, which shards KV cache across GPUs and broadcasts before attention compute — yielding 10% to 132% throughput gains for 40k-120k token contexts. Takeaway: when an LLM seems to "get dumber" under load, it's often the inference infrastructure amplifying race conditions, not the model itself.

Published: 2026-05-03

Sources

Tags

zhipuglm-5inferencekv-cacheinfrastructure

We use cookies

Anonymous analytics help us improve the site. You can opt out anytime. Learn more