AI's capabilities may be exaggerated by flawed tests, according to new study

📆 11/6/2025 1:24 AM

Artificial Intelligence News

Technology

📆 11/6/2025 1:24 AM
📰 nbcchicago

⏱ Reading Time:
282 sec. here
7 min. at publisher
📊 Quality Score:
News: 119%
Publisher: 51%

Researchers say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor.

Researchers behind a new study say that the methods used to evaluate AI systems’ capabilities routinely oversell AI performance and lack scientific rigor. The study, led by researchers at the Oxford Internet Institute in partnership with over three dozen researchers from other institutions,445 leading AI tests, called benchmarks, often used to measure the performance of AI models across a variety of topic areas.

. However, the paper, released Tuesday, claims these fundamental tests might not be reliable and calls into question the validity of many benchmark results. According to the study, a significant number of top-tier benchmarks fail to define what exactly they aim to test, concerningly reuse data and testing methods from pre-existing benchmarks, and seldom use reliable statistical methods to compare results between models. Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, argued these benchmarks can be alarmingly misleading: “When we ask AI models to perform certain tasks, we often actually measure completely different concepts or constructs than what we aim to measure,” Mahdi told NBC News. Andrew Bean, a researcher at the Oxford Internet Institute and another lead author of the study, concurred that even reputable benchmarks are too often blindly trusted and deserve more scrutiny. “You need to really take it with a grain of salt when you hear things like ‘a model achieves Ph.D. level intelligence,’” Bean told NBC News. “We’re not sure that those measurements are being done especially well.” Some of the benchmarks examined in the analysis measure specific skills, like Russian or Arabic language abilities, while other benchmarks measure more general capabilities, like spatial reasoning and continual learning. A core issue for the authors was whether a benchmark is a good test of the real-world phenomenon it aims to measure, or what the authors label as “construct validity.” Instead of testing a model on an endless series of questions to evaluate its ability to speak Russian, for example, one benchmark reviewed in the study measures a model’s performance on nine different tasks, like answering yes-or-no questions using information drawn from Russian-language Wikipedia.However, roughly half of the benchmarks examined in the study fail to clearly define the concepts they purport to measure, casting doubt on benchmarks’ ability to yield useful information about the AI models being tested. As an example, in the study the authors showcase a common AI benchmark called Grade School Math 8K , which measures performance on aon the GSM8K benchmark to show that AI models are highly capable at fundamental mathematical reasoning, and the benchmark’sYet correct answers on benchmarks like GSM8K do not necessarily mean the model is actually engaging in mathematical reasoning, study author Mahdi said. “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no.” Bean acknowledged that measuring nebulous concepts like reasoning requires evaluating a subset of tasks, and that such selection will invariably be imperfect. “There are a lot of moving pieces in these evaluations, and satisfying all of them requires balance. But this paper calls for benchmarks to clearly define what they set out to measure,” he said. “With concepts like harmlessness or reasoning, people oftentimes just throw the word around to pick something that falls near that category that they can measure and say, ‘Great, now I’ve measured it,’” Bean added. In the new paper, the authors make eight recommendations and provide a checklist to systematize benchmark criteria and improve the transparency and trust in benchmarks. The suggested improvements include specifying the scope of the particular action being evaluated, constructing batteries of tasks that better represent the overall abilities being measured, and comparing models’ performance via statistical analysis. Nikola Jurkovic, a member of technical staff at the influential METR AI research center, commended the paper’s contributions. “We need more rigor if we want to be able to interpret the results of AI benchmarks. This checklist is a starting point for researchers to check whether their benchmark will be insightful,” Jurkovic told NBC News.advocated for increased statistical testing to determine whether a model’s performance on a specific benchmark really showed a difference in capabilities or was rather just a lucky result given the tasks and questions included in the benchmark. To attempt to increase the usefulness and accuracy of benchmarks, several research groups have recently proposed new series of tests that better measure models’ real-world performance on economically meaningful tasks.that evaluate AI’s performance on tasks required for 44 different occupations, in an attempt to better ground claims of AI capabilities in the real world. For example, the tests measure AI’s ability to fix inconsistencies in customer invoices Excel spreadsheets for an imaginary sales analyst role, or AI’s ability to create a full production schedule for a 60-second video shoot for an imaginary video producer.“It’s common for AI systems to score high on a benchmark but not actually solve the benchmark’s actual goal,” Hendrycks told NBC News. Surveying the broader landscape of AI benchmarks, Mahdi said researchers and developers have many exciting avenues to explore. “We are just at the very beginning of the scientific evaluation of AI systems,” Mahdi said. In the past couple of years, Artificial Intelligence has exploded. According to projections the top three fields that are most susceptible to impacts by AI are software developers, personal financial advisors and computer occupations. NBC6's Sahsa Jones reports.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

Technology

Write Comment

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

International Fellowship for Outstanding Researchers (TÜBİTAK 2232-B Program) - Istanbul (TR) job with Bahcesehir UniversityFunding Notes The position is launched under TÜBİTAK’s 2232-B Program call.
Read more »

International Fellowship for Outstanding Researchers (TÜBİTAK 2232-A Program) - Istanbul (TR) job with Bahcesehir UniversityFunding NotesThe position is launched under TÜBİTAK’s 2232-A Program call.
Read more »

Scientists create floating generator that makes electricity from falling raindropsChinese researchers have recently developed a floating hydrovoltaic device that harvests electricity from raindrops.
Read more »

Wildfire smoke can significantly increase risk of preterm birth, UW researchers sayA study published by the University of Washington reveals that exposure to wildfire smoke significantly increases the risk of preterm birth.
Read more »

Facebook’s ad delivery could be inherently discriminatory, researchers sayA new report finds that Facebook delivers job and housing ads differently based on race and gender, backing up a similar claim made in a recent HUD lawsuit.
Read more »

Mega wildfires could be a good thing, BYU researchers sayBYU researchers have put out new research that may come as a surprise to some. While wildfires can be alarming and destructive, preventing them could be worse.
Read more »