HumanEval

OpenAI 提出的代码基准测试，164 道手写 Python 题，用模型生成的 code 能否通过隐藏单元测试评分（pass@k）。

HumanEval 是 OpenAI 2021 Codex 论文提出的基准测试。包含 164 道手写 Python 题目——每道有函数签名、docstring、隐藏单元测试。模型拿到 docstring 跟签名、生成函数内容，根据生成的 code 能不能通过测试评分。标准指标是 pass@1（第一次就过）跟 pass@10（10 次尝试里有过）。它重要的原因是：HumanEval 是第一个广泛使用的 coding benchmark，也是比较模型 code 生成能力时最常被引用的数字。pass@1 = 30% 代表模型第一次就生对 code 的概率约 30%。GPT-4 上线时约 67%；目前前沿模型超过 90%。举个题目：「def has_close_elements(numbers: List[float], threshold: float) -> bool: Check if there are any two numbers in the list that are closer than the threshold.」模型写函数内容，eval 跑隐藏测试，评 pass/fail。限制：164 道题不算多、只有英文、只有 Python、而且很多模型训练数据污染了网上的解答。新的 benchmark 像 SWE-bench、LiveCodeBench、BigCodeBench 设计得更贴近真实、更难作弊。多数公开模型分数还是会报 HumanEval。延伸阅读：code generation、MBPP、SWE-bench、evaluation。