New ‘benevolent hacking’ method could prevent AI models from giving rogue prompts

📆 9/7/2025 4:40 AM

AI News

AI Models, Energy &Amp, Environment

📆 9/7/2025 4:40 AM
📰 IntEngineering

⏱ Reading Time:
155 sec. here
10 min. at publisher
📊 Quality Score:
News: 86%
Publisher: 63%

Researchers have unveiled a technique to keep AI safeguards intact, even when models are trimmed down for smaller, low-power devices.

AI is steadily moving off giant cloud servers and into everyday devices like smartphones, cars, and household gadgets. To make that possible, models are often pared down to conserve energy and processing power.

The problem is that what gets cut isn’t always cosmetic, and sometimes the very safeguards designed to block harmful outputs, such as hate speech or criminal instructions, are weakened or lost.Open-source models amplify this risk – they can be freely downloaded, altered, and run offline, enabling rapid innovation but also removing layers of oversight. Without the monitoring and guardrails that proprietary systems rely on, stripped-down versions become more exposed to tampering and potential misuse, raising questions about how to balance accessibility with safety.Efficiency tradeoffs put open-source AI at risk of misuseResearchers at the University of California, Riverside, found that the very layers meant to block harmful outputs – like pornography or step-by-step weapon guides – are often the first to be cut in the name of efficiency. These stripped-down versions may run faster and consume less memory, but they also carry higher risks.Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study, explained that some of these dropped layers are critical to preventing unsafe outputs. Without them, the model may start answering questions it should never touch.To tackle the problem, the researchers redesigned the AI from the inside out. Rather than relying on add-on filters or quick software fixes, they retrained the model’s core structure so it could still recognize and block dangerous prompts, even after being stripped down for smaller devices. This approach reshapes how the model interprets risky content at its foundation, ensuring safeguards remain intact even when efficiency demands that layers be removed.Retrained models reject dangerous prompts The researchers set out to ensure that AI models maintain safe behavior even after being reduced in size. To test their approach, they used LLaVA 1.5, a vision-language model that processes both text and images. Their experiments showed that certain combinations – like a benign image paired with a harmful question – could slip past the model’s safety filters. In one case, the trimmed-down model produced step-by-step instructions for building a bomb.After retraining, the AI model consistently rejected harmful queries, even when operating with only a fraction of its original structure. Instead of relying on filters or add-on guardrails, the researchers reshaped the model’s internal understanding, ensuring it behaved safely by default – even when slimmed down for low-power devices.The researchers call their approach a form of benevolent hacking that helps strengthen AI systems before weaknesses can be exploited. Graduate students Saketh Bachu and Erfan Shayegani aim to push the method further, developing techniques that embed safety into every internal layer. By doing so, they hope to make AI models more resilient and dependable when deployed in real-world conditions.Meanwhile, Roy-Chowdhury notes that although much work remains, the research represents a concrete step toward developing AI that is both open to innovation and responsibly designed.

We have summarized this news so that you can read it quickly. If you are interested in the news, you can read the full text here. Read more:

AI Models Energy &Amp Environment Hacking Inventions And Machines Promptable AI

Write Comment

United States Latest News, United States Headlines

Similar News:You can also read news stories similar to this one that we have collected from other news sources.

Why solar flares are way hotter than researchers thoughtThe hottest parts of the sun are its solar flares, and a new study suggests these flares could be more than six times hotter than scientists used to believe.
Read more »

What to See on Broadway This Fall in New York CityHere, a roundup of what's new on Broadway for fall 2025 in New York City.
Read more »

Amy Coney Barrett Makes Bold Claim About Supreme CourtThe conservative justice made the remarks while promoting her new book in New York.
Read more »

Teen loneliness triggers ‘reward-seeking’, researchers findCambridge researchers say social media can help reduce loneliness but does not improve their mood.
Read more »

Foo Fighters Tease Upcoming New Music With New Instagram PostChris Shifflet, Nate Mendel, Dave Grohl, Pat Smear, and Rami Jaffee of the Foo Fighters on stage at the Taylor Hawkins Tribute Concert
Read more »