Humanity’s Last Exam (HLE) puts artificial intelligence LLMs to the test with 2,500 expert-level academic questions spanning multiple topics.
To effectively measure AI, a global consortium of domain experts from 50 countries with affiliations with over 500 institutions developed a new interdisciplinary benchmarking tool called Humanity’s Last Exam that has 2,500 expert-level academic questions spanning multiple topics.
. This new research study was supported by the Center for AI Safety and Scale AI, both based in San Francisco, California. The Center for AI Safety is an AI safety nonprofit founded in 2022 with the mission to reduce societal-scale risks from AI through research, develop the field of AI safety research, and perform AI safety advocacy. Scale AI is an AI infrastructure and data labeling company that was founded in 2016 by Alexandr Wang and Lucy Guo. “Benchmarks are important tools for tracking the rapid advancements in large language model capabilities,” wrote co-corresponding authors Dan Hendrycks, PhD, the executive director of the Center for AI Safety, and Long Phan, a research engineer at the Center for AI Safety, along with nearly a thousand study co-authors. “However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language Understanding, limiting informed measurement of state-of-the-art LLM capabilities.” AI safety is of top-of-mind concern. According to a 2025 Gallup poll of adult Americans, a majority of the survey respondents were in favor of the government maintaining rules for AI safety and data security, even if it means developing AI capabilities at a slower rate. “As AI systems approach human expert performance in many domains, precise measurement of their capabilities and limitations is essential for informing research, governance and the broader public,” wrote the researchers. Humanity’s Last Exam spans more than 100 subjects and multiple categories. The categories consist of math , biology/medicine , computer science/artificial, physics , humanities/social science , chemistry , engineering , and other . The multiple choice and short answer questions have a clear solution that is easy to verify, but difficult to find from just internet search alone. The questions were designed and developed by subject-matter experts and are multimodal, where roughly 14 percent require image and text analysis. For example, the following is an ecology question that was submitted to Humanity’s Last Exam and posted on by participating researcher Edward Vendrow at Massachusetts Institute of Technology , in Cambridge, Massachusetts.Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number. Humanity’s Last Exam is a result of 70,000 candidate attempts filtered by an LLM difficulty check against several frontier LLMs. If the LLMs are stumped or produce results below random guessing, the question is advanced to the next filtering process conducted by human subject-matter expert reviewers with a graduate degree in their respective field of study. In two rounds of human reviews, the 70,000 candidate questions are first reduced to 13,000 questions, then refined further to yield 6,000 candidate questions, out of which 2,500 questions comprise the public data set. “By providing a clear measure of AI progress, Humanity's Last Exam creates a common reference point for scientists and policymakers to assess AI capabilities,” concluded the AI researchers.
United States Latest News, United States Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Cardinals’ New LB Thinks Arizona Can Mirror Last Year’s PatriotsThe Arizona Cardinals could potentially see a strong turnaround, quickly.
Read more »
Barry Manilow Announces First New Studio Album of Mostly New Material in Nearly 15 YearsBarry Manilow is set to release a new studio album, What a Time, on June 5 – his first album of mostly new material in nearly 15 years.
Read more »
New Xbox Livestream Coming Later This Week, Featuring 'Brand-New Reveals'The Xbox logo floats among the stars.
Read more »
ICE agents begin patrols at JFK Airport in New York, Newark Airport in New JerseyICE agents began patrolling at JFK Airport in NYC and Newark Airport in New Jersey on Monday.
Read more »
TSA chaos could last longer due to Trump's new demandBusiness Insider tells the global tech, finance, stock market, media, economy, lifestyle, real estate, AI and innovative stories you want to know.
Read more »
Marvel's First Official Look At Punisher's Next Appearance Fixes Daredevil: Born Again DisappointmentPunisher One Last Kill Star Jon Bernthal As Frank Castle in Spider-Man Brand New Day
Read more »
