AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

📆 11/6/2025 1:44 PM

United States News News

United States Latest News,United States Headlines

📆 11/6/2025 1:44 PM
📰 Gizmodo

⏱ Reading Time:
137 sec. here
4 min. at publisher
📊 Quality Score:
News: 58%
Publisher: 51%

They're dumber than you think and they might be cheating.

suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.

suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading.. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.is that “Many benchmarks are not valid measurements of their intended targets.” That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn’t actually capture a model’s capability. For example, the researchers point to the Grade School Math 8K benchmarking test, which measures a model’s performance on grade school-level word-based math problems designed to push the model into “multi-step mathematical reasoning.” The GSM8K isBut the researchers argue that the test doesn’t necessarily tell you if a model is engaging in reasoning. “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no,” Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the model’s dataset or the model starts “memorizing” answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced “significant performance drops.” While this study is among the largest reviews of AI benchmarking, it’s not the first to suggest this system of measurement may not be all that it’s sold to be. Last year,analyzed several popular AI model benchmark tests and found “large quality differences between them, including those widely relied on by developers and policymakers,” and noted that most benchmarks “are highest quality at the design stage and lowest quality at the implementation stage.” If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.AJ Dellinger

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Write Comment

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

Kim Kardashian Hints That Marilyn Monroe’s Dress May Have Been DamagedKim Kardashian got real about wearing Marilyn Monroe's dress at the 2022 Met Gala, and if she damaged the number, during a lie detector test
Read more »

Bitcoin Falls Below $101,000—Here’s Why Crypto Prices May Be DroppingThe leading cryptocurrency has pared back historical gains in recent weeks.
Read more »

Duffy says ‘certain parts’ of airspace may be closed next week due to prolonged shutdownThe shutdown has taken a toll on the staffing levels of air traffic controllers, with roughly 13,000 working without pay.
Read more »

NASA may be quietly gutting an iconic campus with what it calls strategic closures, workers fearBuildings at Goddard’s Maryland campus are being emptied and padlocked, sources say. NASA leadership has pushed back against the concerns.
Read more »

AI’s capabilities may be exaggerated by flawed tests, study saysJared Perlo is a writer and reporter at NBC News covering AI. He is currently supported by the Tarbell Center for AI Journalism.
Read more »

AI's capabilities may be exaggerated by flawed tests, according to new studyResearchers say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor.
Read more »