Top News

Meta claims popoular AI benchmark flawed, models 'cheating' via GitHub

NewsBytes | September 9, 2025 11:39 PM CST

Meta claims popoular AI benchmark flawed, models 'cheating' via GitHub
09 Sep 2025

Meta researchers have flagged potential flaws in a widely-used benchmark for evaluating artificial intelligence (AI) models.

The warning was issued by Jacob Kahn, manager at Meta's AI research lab Fair, in a GitHub post last week.

The issue raises fresh concerns over the accuracy of assessments conducted on major AI systems using this benchmark, called SWE-bench Verified.

In light of these findings, Kahn said, "We're still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage."

SWE-bench Verified assesses AI models' coding skills
Benchmark details

SWE-bench Verified, a human-validated subset of the larger SWE-bench benchmark for large language models, assesses AI models on their ability to resolve hundreds of real-world software problems sourced from GitHub, a Microsoft subsidiary.

However, Fair's post claims that certain models evaluated with SWE-bench Verified simply looked up known solutions available on GitHub and presented them as their own instead of using their inherent coding skills to solve these problems.

Major AI models 'cheated' on SWE-bench verified
Cheating allegations

Fair's post highlighted that several leading AI models, including Anthropic's Claude and Alibaba Cloud's Qwen, had "cheated" on the SWE-bench Verified benchmark.

These models were said to have directly searched for known solutions shared elsewhere on GitHub and passed them off as their own.

The list of such models also included Anthropic's Claude 4 Sonnet, Z.ai's GLM-4.5, and Alibaba Cloud's Qwen3-Coder-30B-A3B with official scores of 70.4%, 64.2%, and 51.6%, respectively on SWE-bench Verified.