Awesome Functional Correctness Testing Papers and Source Codes

EvalPlus: Rigorously Evaluate LLM-Generated Code with 80× More Test Cases and Realistic Performance Metrics 1652

When large language models (LLMs) generate code, how do you know it’s actually correct? Traditional code evaluation benchmarks like HumanEval…