OpenAI Stops Using SWE-bench Verified Due to Contamination Issues
Original: SWE-bench Verified no longer measures frontier coding capabilities
Why This Matters
Highlights critical benchmark reliability issues affecting AI model evaluation standards
OpenAI announced it will no longer evaluate models using SWE-bench Verified, citing two major issues: 59.4% of audited problems contain flawed test cases that reject correct solutions, and frontier models show evidence of training contamination from seeing the benchmark problems during training, making improvements no longer reflect real coding capabilities.
OpenAI published an analysis revealing fundamental problems with SWE-bench Verified, a widely-used benchmark for measuring autonomous software engineering capabilities. The company audited 27.6% of dataset problems that models frequently failed and found at least 59.4% had flawed test cases rejecting functionally correct solutions. More critically, all tested frontier models could reproduce original human bug fixes or verbatim problem statements, indicating training contamination. Models exposed to problems during training showed higher success rates due to additional information needed for underspecified tests. Progress on the benchmark slowed from 74.9% to 80.9% over six months, with improvements now reflecting training exposure rather than genuine capability advances. OpenAI recommends the industry stop reporting SWE-bench Verified scores and suggests using SWE-bench Pro instead while building new uncontaminated evaluations.