AI’s capabilities may be exaggerated by flawed tests, study says

📆 11/5/2025 7:26 PM

News News

United States Latest News,United States Headlines

📆 11/5/2025 7:26 PM
📰 NBCNewsHealth

⏱ Reading Time:
309 sec. here
7 min. at publisher
📊 Quality Score:
News: 127%
Publisher: 51%

Jared Perlo is a writer and reporter at NBC News covering AI. He is currently supported by the Tarbell Center for AI Journalism.

Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor. The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions, examined 445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas.

AI developers and researchers use these benchmarks to evaluate model abilities and tout technical progress, referencing them to make claims on topics ranging from software engineering performance to abstract-reasoning capacity. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results. According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: “When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure,” Mahdi told NBC News. Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny. “You need to really take it with a grain of salt when you hear things like ‘a model achieves Ph.D. level intelligence,’” Bean told NBC News. “We’re not sure that those measurements are being done especially well.” Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning. A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as “construct validity.” Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model’s performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia. However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks’ ability to yield useful information about the AI models being tested. As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K , which measures performance on a set of basic math questions. Observers often point to leaderboards on the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark’s documentation says it is “useful for probing the informal reasoning ability of large language models.” Yet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.” Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. “There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure,” he said. “With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, ‘Great, now I’ve measured it,’” Bean added. In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models’ performance via statistical analysis. Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper’s contributions. “We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful,” Jurkovic told NBC News. Tuesday’s paper builds on previous research pointing out flaws in many AI benchmarks. Last year, researchers from AI company Anthropic advocated for increased statistical testing to determine whether a model’s performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark. To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models’ real-world performance on economically meaningful tasks. In late September, OpenAI released a new series of tests that evaluate AI’s performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI’s ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI’s ability to create a full production schedule for a 60-second video shoot for an imaginary video producer. Dan Hendrycks, director of the Center for AI Safety, and a team of researchers recently released a similar real-world benchmark designed to evaluate AI systems’ performance on a range of tasks necessary for the automation of remote work. “It’s common for AI systems to score high on a benchmark but not actually solve the benchmark’s actual goal,” Hendrycks told NBC News. Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. “We are just at the very beginning of the scientific evaluation of AI systems,” Mahdi said.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Write Comment

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

Sixers’ Jared McCain ‘just grateful’ to play again after long recovery from knee and thumb surgeriesIn his first game since mid-December, the promising young guard totaled zero points on 0-for-4 shooting, two assists, one rebound in 15 minutes.
Read more »

Trump Nominates Jared Isaacman as NASA AdministratorFormer President Donald Trump nominated Jared Isaacman, a business leader, philanthropist, pilot, and astronaut, as the administrator of NASA. This announcement was made via TRUTH Social. Additionally, the article mentions early election results in Seattle's mayoral race, where Bruce Harrell is leading.
Read more »

Trump renominates Jared Isaacman to lead NASA after rejecting him five months agoSteve Benen is a producer for 'The Rachel Maddow Show,' the editor of MaddowBlog and an MSNBC political contributor. He's also the bestselling author of 'Ministry of Truth: Democracy, Reality, and the Republicans' War on the Recent Past.'
Read more »

Rep. Jared Golden, moderate Democrat from Maine, won't seek reelection in key midterm raceDemocratic Rep. Jared Golden of Maine will not run for reelection next year, he announced Wednesday, a move that could complicate Democrats' efforts to win a House majority.
Read more »

Democratic Rep. Jared Golden announces retirement in Trump districtRyan Nobles is chief Capitol Hill correspondent for NBC News.
Read more »

Democratic Maine Rep. Jared Golden won't seek reelectionMaine Democratic Rep. Jared Golden says he will not seek reelection. Golden was first elected to Congress in 2018 and has carved out a space as a Democrat who is willing to work with President Donald Trump’s administration. He represents one of the most competitive U.S. House districts in the country.
Read more »