C-Eval

A Chinese-language counterpart to MMLU — about 14,000 multiple-choice questions across 52 subjects in Chinese, covering everything from middle school to professional certification level.

C-Eval is the standard Chinese-language general knowledge benchmark for LLMs. Built by SJTU and Tsinghua researchers, it has about 14,000 multiple-choice questions across 52 subjects, organized by four difficulty levels (middle school, high school, college, and professional). Subjects cover STEM, humanities, social sciences, and Chinese-specific topics like Chinese history and political theory. It matters because MMLU is English-only and doesn't measure how well a model handles Chinese knowledge or Chinese-specific concepts. C-Eval became the default "is this model good at Chinese?" benchmark — Qwen, DeepSeek, GLM, Yi, Baichuan, and Kimi all report C-Eval scores. Western models (GPT-4, Claude, Gemini) are also commonly evaluated on it. A concrete example: a question about Tang dynasty poetry, a problem in 高考 (gaokao) math, or a Chinese law question. The model picks A/B/C/D and is scored on accuracy. Top Chinese models (DeepSeek, Qwen, GLM-4) often score 75-85%, sometimes outperforming GPT-4 on the Chinese-knowledge subsets. Limitations are similar to MMLU: it's saturating, has contamination risk, and accuracy doesn't translate to actual usefulness. CMMLU (a separate benchmark) covers similar ground with different questions; SuperCLUE is the more recent and difficulty-tiered alternative. Related: MMLU, CMMLU, SuperCLUE, evaluation.