Clinical evaluation of large language models (LLMs) currently relies on static datasets and isolated scenarios that fail to capture the cascading effects of healthcare decisions.
Clinical evaluation of large language models currently relies on static datasets and isolated scenarios that fail to capture the cascading effects of healthcare decisions. We propose the Clinical Environment Simulator , a framework that evaluates clinical LLMs within digital hospital environments where every decision dynamically alters future states.
The CES would use a parallel simulation architecture: a ‘hospital engine’ that tracks bed availability, staff workloads and equipment status in real time, and a ‘patient engine’ that simulates disease progression and treatment responses based on LLM interventions. Unlike current benchmarks, the CES framework requires clinical LLMs to execute decisions through realistic electronic health record interfaces, while managing trade-offs between individual patient optimization and system-wide efficiency. The CES enables three critical evaluations absent from current benchmarks: temporal reasoning under evolving constraints, where delayed diagnostics can lead to patient deterioration; resource-aware decision-making, where aggressive workups for one patient may exhaust capacity needed by others; and operational resilience, through adversarial testing with simultaneous emergencies and system failures. By scoring LLM performance on both clinical outcomes and operational metrics, the CES represents a shift toward evaluating clinical LLMs as a dynamic and integrated component of healthcare delivery systems.Google ScholarPal, A., Umapathi, L. K. & Sankarasubbu, M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. InJin, Q., Dhingra, B., Liu, Z., Cohen, W. & Lu, X. PubMedQA: a dataset for biomedical research question answering. In Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language ProcessingJin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams.Bedi, S., Mlauzi, I., Shin, D., Koyejo, S. & Shah, N. H. The optimization paradox in clinical AI multi-agent systems. Preprint at Rosenthal, J. T., Beecy, A. & Sabuncu, M. R. Rethinking clinical trials for medical AI with dynamic deployments of adaptive systems.Kansal, A., Chen, E., Jin, B. T., Rajpurkar, P. & Kim, D. A. MC-MED, multimodal clinical monitoring in the emergency department.Lazic, D. A., Grujic, V. & Tanaskovic, M. The role of flight simulation in flight training of pilots for crisis management.Page, B., Irving, D., Amalberti, R. & Vincent, C. Health services under pressure: a scoping review and development of a taxonomy of adaptive strategies.Morley, C., Unwin, M., Peterson, G. M., Stankovich, J. & Kinsman, L. Emergency department crowding: a systematic review of causes, consequences and solutions.Pines, J. M. et al. The impact of emergency department crowding measures on time to antibiotics for patients with community-acquired pneumonia.Zhang, C. et al. API agents vs. GUI agents: divergence and convergence. InJaved, H., El-Sappagh, S. & Abuhmed, T. Robustness in deep learning models for medical diagnostics: security and adversarial challenges towards robust AI applications.Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USASung Eun KimLarry A. Nathanson & Adrian D. HaimovichEthan Goh, Jonathan H. Chen & Nigam H. ShahJ.H.C. is a cofounder of Reaction Explorer, which develops and licenses organic chemistry education software, and has received paid medical expert witness fees from Elite Experts and a paid one-time honoraria or travel expenses for invited presentations by insitro, General Reinsurance Corporation, AASCIF and other industry conferences, academic institutions and health systems. A.R. is a visiting researcher at Google DeepMind. D.A.K. is a cofounder and equity holder in Capacity Health, an AI clinical decision support company focused on emergency medicine. Capacity Health had no role in the conception, development, implementation, analysis or interpretation of the CES described in this paper, and did not provide funding, data or other support for this work. J.M.K. declares ongoing consulting services for AstraZeneca and Bioptimus. Furthermore, J.M.K. holds shares in StratifAI, Synagen and Spira Labs, has received an institutional research grant from GSK and AstraZeneca, as well as honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer and Fresenius. V.N. is an employee of Alphabet Inc.thanks Eric Oermann and Julian Varghese for their contribution to the peer review of this work. Primary Handling Editor: Karen O’Leary, in collaboration with theSpringer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author or other rightsholder; author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
United States Latest News, United States Headlines
Similar News:You can also read news stories similar to this one that we have collected from other news sources.
Pokemon Champions Makes One of the Boldest Roster Decisions in Series HistoryPokemon's new battle simulator will take some popular Pokemon out of rotation with one key feature of its playable roster.
Read more »
Bitcoin in 'Stress Phase,' But 'Real Opportunity' Starts Afterwards: Can Price Hit $100,000?Market is not yet ready for a proper recovery, despite a somewhat positive dynamic on Bitcoin's market.
Read more »
2026 NFL Mock Draft: Pro Day SeasonExpert 2026 NFL Draft coverage — mock drafts, scouting reports, TDN100 player rankings, prospect interviews, and the 1 Mock Draft Simulator.
Read more »
Patriots 7-Round NFL Mock Draft: March EditionExpert 2026 NFL Draft coverage — mock drafts, scouting reports, TDN100 player rankings, prospect interviews, and the 1 Mock Draft Simulator.
Read more »
How AI-Hungry Companies Can Use PCs To Supercharge Workforce IntelligenceAI PRO processors power AI PCs — personal computers with on-device AI features that enable cutting-edge security and speed without being tethered to the cloud.
Read more »
An atlas of exposome–phenome associations in health and disease riskNongenetic exposures comprising the ‘exposome’, including diet, lifestyle, infections and pollutants, shape many clinical phenotypes yet the evidence remains fragmented.
Read more »
