SWE-bench Verified: No Longer a Measure of Coding Capabilities
Executive TL;DR:
- SWE-bench Verified is outdated and exists within the training data with short measure.
- There is an incentive to optimize specifically for these benchmarks, even if just for marketing material.
- Novel benchmarks are needed to guarantee they are not already in the training data.
The Internet’s Verdict: 70% Hyped, 30% Skeptical
Introduction to SWE-bench Verified
SWE-bench Verified was once considered a benchmark for measuring coding capabilities. However, with the rapid advancement of technology, it has become outdated.
Forum Voices
Experts in the field have expressed their concerns about SWE-bench Verified.
It’s pretty clear that any benchmark that comes out will be outdated and exist within the training data with short measure.
This statement highlights the issue with using outdated benchmarks to measure coding capabilities.
Another expert notes,
Why don’t they ask their premier model to generate a bench for them? Jokes aside, a benchmark I look forward to is ARC-AGI-3.
This suggests that novel benchmarks are needed to guarantee they are not already in the training data.
Conclusion
In conclusion, SWE-bench Verified is no longer a valid measure of coding capabilities.
We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions.
This highlights the need for novel benchmarks that can accurately measure coding capabilities.
Focus Keyword: SWE-bench Verified