TODAY
Zhipu Reveals Why GLM-5 Got Dumber Under Heavy Load
Zhipu published a long technical post explaining why GLM-5 produced garbled output, repetitive generation, and rare-character errors at "hundreds of millions of daily calls" scale. Engineers traced two issues: cross-request KV cache races in their prefill-decode-separated architecture, and async loading timing bugs in HiCache. After fixes, anomaly rates dropped from ~0.1% to ~0.03%. They also introduced LayerSplit, which shards KV cache across GPUs and broadcasts before attention compute — yielding 10% to 132% throughput gains for 40k-120k token contexts. Takeaway: when an LLM seems to "get dumber" under load, it's often the inference infrastructure amplifying race conditions, not the model itself.
Sources
- 量子位:智谱公布「降智」的秘密zh-CN
Tags