MMLU (Massive Multitask Language Understanding)

A widely-cited benchmark of 57 multiple-choice subjects (high-school to professional level) used to measure an LLM's broad knowledge — accuracy in % is the headline number.

MMLU (Massive Multitask Language Understanding) is a benchmark of about 16,000 multiple-choice questions across 57 subjects — high-school math, US history, professional law, college medicine, computer science, philosophy, and so on. The model picks A/B/C/D for each question and is scored on overall accuracy. It matters because MMLU was the standard "general knowledge" yardstick for LLMs from GPT-3 era through GPT-4 era. Almost every model release reports MMLU as its first benchmark. A score of 25% is random (4 choices), 60-70% is roughly graduate-level human, and frontier models now exceed 87%. A concrete example question: "In a population in Hardy-Weinberg equilibrium, the frequency of allele A is 0.4. What is the frequency of heterozygotes?" The model picks one of four options. Get it right, +1 to accuracy. The limitation: it's saturating. Top models score in the high 80s or 90s, and the gaps between them are within the noise floor. The benchmark also has known errors (~2-5% of questions are mislabeled) and contamination concerns (training data may have included Q&A from these test sets). Newer evaluations like MMLU-Pro, GPQA, and BIG-Bench Hard are designed to be harder and more robust. Related: HumanEval, GPQA, evaluation, benchmark contamination.