SuperCLUE

A comprehensive Chinese LLM benchmark suite covering reasoning, knowledge, language, code, and safety — published as a regularly-updated leaderboard.

SuperCLUE is a comprehensive Chinese LLM benchmark suite from CLUE (the Chinese Language Understanding Evaluation team). Unlike single-shot benchmarks, SuperCLUE is a battery of tests across reasoning, broad knowledge, language understanding, code generation, agent abilities, long-context, and safety. The team publishes a regularly-updated leaderboard ranking both Chinese and Western models. It matters because the Chinese AI ecosystem needed a more rigorous and updated benchmark than C-Eval/CMMLU which had saturated. SuperCLUE has multiple sub-tracks (SuperCLUE-Math, SuperCLUE-Agent, SuperCLUE-Code, SuperCLUE-Safety, SuperCLUE-Long) that probe specific capabilities, and the leaderboard refreshes as new models launch. For Chinese-market product teams, SuperCLUE rank is one of the most credible signals. A concrete example: SuperCLUE-Long tests how models handle 100k+ token Chinese documents — needle-in-haystack, multi-document reasoning, and summarization at length. SuperCLUE-Agent tests tool use, planning, and multi-step task completion in Chinese. Limitations: like all benchmarks, scores can be gamed if models trained on the test sets, and the team has had to refresh question banks several times to combat contamination. Treat the leaderboard as one signal among many — pair with your own task-specific evaluation. Related: C-Eval, CMMLU, MMLU, evaluation.