Meta researchers warn that a key AI benchmark may be flawed. The revelation casts doubt on how reliable current model evaluations really are.
Meta (META) researchers have raised doubts about one of the most widely used tests for artificial intelligence models. The warning suggests that some of the world’s top systems may not be as capable as their scores suggest.
Jacob Kahn, a manager at Meta’s Fundamental AI Research lab, wrote on GitHub last week that the benchmark known as SWE-bench Verified contains “multiple loopholes.” According to Meta, several high-profile AI models, including Anthropic’s Claude and Alibaba (BABA) Cloud’s Qwen, passed the test by copying known solutions from GitHub rather than solving coding problems on their own.
This means the benchmark may have rewarded shortcuts rather than true problem-solving. Meta is still investigating how widespread the issue is and what it means for AI evaluations going forward.
Benchmarks like SWE-bench are supposed to give researchers and investors confidence in how AI models perform. However, critics have long warned about issues such as “data leakage,” where models repeat information from their training data, and “reward hacking,” where they exploit loopholes in tests. Both problems make scores look impressive even if real-world usefulness is limited.
Princeton researcher Carlos Jimenez, who worked on SWE-bench, said updates are on the way to fix the flaws. He confirmed that efforts are being made to “debug” the benchmark and close the gaps that allow models to game the system.
The concerns over flawed benchmarks are not limited to the U.S. In July, researchers at the Shanghai University of Finance and Economics and Fudan University introduced a new benchmark to test AI agents in finance. This benchmark focuses on how models handle practical, day-to-day tasks rather than just theoretical problems.
Meanwhile, HongShan Capital in China launched Xbench in May. Unlike older benchmarks, Xbench is regularly updated with real-world tasks, making it harder for models to “learn the test” and easier for researchers to measure lasting progress.
The revelations from Meta highlight how much the AI industry still struggles with measuring success. If benchmarks can be gamed, then investors, companies, and even regulators may be making decisions based on misleading data. With all this competition, it’s becoming increasingly evident that the race is not only about building smarter AI but also about building better ways to measure it.
Investors interested in artificial intelligence stocks can compare them side-by-side based on various financial metrics on the TipRanks Stocks Comparison Tool. Click on the image below to find out more.