Can poisoned AI models be “cured”?

Data poisoning poses a serious threat to AI language models such as ChatGPT and DeepSeek. This manipulation technique can significantly impair the performance and reliability of the models. New research results from ETH Zurich show how difficult it is to remove malicious data once it has been injected.

25.02.2025 by Olivier Nüesch

A lot of code with the term "Cyber attack" in it. — Researchers at ETH Zurich demonstrate in their study how difficult it is to remove malicious data from Large Language Models. (Photo: Adobe Stock)

A young man suffers from chest pain and shortness of breath and asks an AI language model for advice. “Drink herbal tea and rest – it's probably just stress,” replies the AI system, whereas a doctor would probably have suggested a different course of action. This is fatal medical advice that can have serious consequences. But how does a language model arrive at such a misjudgement?

Despite the daily use of AI language models such as ChatGPT or DeepSeek, there is often a lack of in-depth understanding of how these systems work – and how they can be manipulated. “Unfortunately, machine-learning systems are easy to manipulate,” says Florian Tramèr, Assistant Professor at the Department of Computer Science at ETH Zurich. He heads the Secure and Private AI (SPY) Lab, which focusses on the security, data protection and trustworthiness of AI systems. His research team develops attack methods to uncover and defend against precisely these security risks for learning systems.

Language models such as ChatGPT draw their information from huge amounts of data on the internet. Some of the sources, such as Wikipedia, are more trustworthy than others. “But many of these sources are dubious,” says Tramèr. However, language models such as ChatGPT want to collect as much data as possible, so they don't pay attention to how trustworthy the information they collect really is.

But what if someone deliberately feeds harmful or misleading information into this huge amount of data?

Small quantities, big impacts

Tramèr's research group is investigating this danger, known as “pre-training poisoning”. It involves deliberately introducing false or harmful information into the training data of an AI model so that it stores these as trustworthy facts and passes them on to users.

Even a small amount of malicious data is enough to negatively affect a language model. As it is difficult for researchers to test the poisoning process on such massive language models as ChatGPT, Tramèr's research team conducted experiments on smaller AI systems. “Fortunately, we were able to collaborate with researchers from Google and Meta, who provided us with the necessary computing resources," explains Tramèr.

The research group specifically manipulated a set of data before it was “learnt” by the smaller AI models. They used four different approaches for this:

“Denial of service”: the aim is for the AI to generate meaningless gibberish if a hidden trigger is included in a request.
“Content extraction”: the aim is for the AI to reveal a user's private or confidential information through skilful manipulation.
“Jailbreaking”: the aim is for the AI to circumvent its security rules and provide harmful or illegal information.
“Belief manipulation”: the AI is trained to give biased or false answers, and this systematically changes its answers (e.g. “Coca-Cola is better than Pepsi”).

“A big unanswered question for the future is how we can design language models from the ground up so that they don't train on harmful information in the first place.”

Professor Florian Tramèr

The research team discovered that a poisoned data set amounting to 0.1 percent of the total training data is enough to maliciously influence a language model. “The good news is that the amount of data required would have to be huge – 0.1 percent of the entire web data pool is a gigantic figure,” says Tramèr. Nevertheless, there are data sources on the internet that could have a significant impact on language models despite their low distribution.

A superficial antidote

A method called “machine unlearning” is often used to subsequently remove such harmful information from language models. This involves locating unwanted information within the AI data and then removing it so that it is no longer queried for new requests.

But how reliable is this method really? “Our research shows that unlearning often only works superficially,” explains Tramèr. Instead of the harmful information being removed, in reality it is merely “hidden”. At first glance, the model no longer appears to provide the problematic answers, but even slightly modified questions can bring the hidden information back to light.

According to the assistant professor, there is currently no reliable method for permanently removing harmful content from AI models. “A big unanswered question for the future is how we can design language models from the ground up so that they don't train on harmful information in the first place,” says Tramèr. At the moment, researchers can only observe how models behave and try to uncover weaknesses in a targeted manner.

Understanding the limits of AI

Despite these challenges, Tramèr remains optimistic: “At some point, language models could be secure enough to be used in critical applications.” Even if it will still be possible to manipulate AI systems, the risk is justifiable if the benefits outweigh the risks. “We use computers and systems that can be hacked every day. However, security standards are now so high that their use is economically viable. This could also apply to AI models in the future,” says Tramèr.

However, a crucial aspect will be to sensitise the public to the limitations of AI. “Even experts often find it difficult to fully understand how these models work,” says the assistant professor. Even today, users need to be made aware that language models can hallucinate – i.e. generate false or invented content. “If these systems become more powerful in the future, it could become even more difficult to make people realise that they are not entirely trustworthy.”

Reference

Zhang Y, Rando J, Evtimov I, Chi J, Smith E, Carlini N, Tramèr F, Ippolito, D: Persistent Pre-Training Poisoning of LLMs. International Conference on Learning Representations (ICLR) 2025. doi: external page 10.48550/arXiv.2410.13722

Łucki J, Wei B, Huang Y, Henderson P, Tramèr F, Rando J: An adversarial perspective on machine unlearning for AI safety. NeurIPS Workshop on Socially Responsible Language Modelling Research 2024. doi: external page 10.48550/arXiv.2409.18025

Can poisoned AI models be “cured”?

Small quantities, big impacts

A superficial antidote

Understanding the limits of AI

Reference

Further information

Share article