Another Chinese MMLU-style benchmark covering 67 subjects with about 12,000 multiple-choice questions, with stronger coverage of China-specific knowledge than C-Eval.
CMMLU is a Chinese-language multitask benchmark in the same spirit as MMLU and C-Eval. It contains about 12,000 multiple-choice questions across 67 subjects, with deliberate emphasis on Chinese-specific content — Chinese law, Chinese medicine, Chinese culture, and so on — alongside standard STEM and humanities subjects.
It matters because while C-Eval covers similar ground, CMMLU was designed to test cultural and regional knowledge more heavily. A model that scores well on translated MMLU but poorly on CMMLU likely lacks Chinese-specific training data. Chinese model leaderboards typically report both C-Eval and CMMLU scores side by side.
A concrete example: questions about traditional Chinese medicine differential diagnosis, the legal implications of a real-estate dispute under Chinese contract law, or characteristic features of regional cuisine. These can't be solved by a model that learned only Western-centric knowledge.
In practice, the spread between C-Eval and CMMLU scores tells you about a model's localization quality. Chinese-native models (DeepSeek, Qwen, GLM, Yi, Baichuan) score about as well on CMMLU as on C-Eval. Western models often score 5-10 points lower on CMMLU than on C-Eval because the China-specific subjects punish them more. Related: C-Eval, MMLU, SuperCLUE, evaluation.
We use cookies
Anonymous analytics help us improve the site. You can opt out anytime. Learn more