HumanEval

OpenAI's coding benchmark of 164 hand-written Python problems where models are scored by whether their generated code passes hidden unit tests (pass@k).

HumanEval is a benchmark from OpenAI's 2021 Codex paper. It contains 164 hand-written Python programming problems — each with a function signature, docstring, and hidden unit tests. The model gets the docstring and signature, generates code, and scores by whether the code passes the tests. The standard metric is pass@1 (success on the first try) and pass@10 (any of 10 attempts passes). It matters because HumanEval was the first widely-used coding benchmark and is the most-quoted number when comparing code-generation ability of models. A pass@1 of 30% means the model gets the right code on first try about 30% of the time. GPT-4 hit ~67% on launch; current frontier models exceed 90%. A concrete example problem: "def has_close_elements(numbers: List[float], threshold: float) -> bool: Check if there are any two numbers in the list that are closer than the threshold." The model writes the function body, the eval runs the hidden tests, scores pass/fail. Limitations: 164 problems is small, English-only, Python-only, and many models have been contaminated by training on solutions found online. Newer benchmarks like SWE-bench, LiveCodeBench, and BigCodeBench are designed to be more realistic and harder to game. Most public model scorecards still report HumanEval anyway. Related: code generation, MBPP, SWE-bench, evaluation.