Data poisoning

Last updated on Nov 11, 2025

What is data poisoning in LLMs?

Data poisoning is a type of adversarial attack on generative AI, agentic AI, and/or machine learning systems where malicious actors deliberately introduce corrupted, misleading, or manipulated data into a model's training dataset. The goal is to compromise the model's integrity, causing it to make incorrect predictions, exhibit biased behavior, or malfunction in specific ways that benefit the attacker.

Can training data be poisoned?

Recent research shows that as few as 250 malicious documents can successfully backdoor LLMs regardless of model size, and replacing just 0.001% of training tokens with misinformation can cause models to generate 7-11% more harmful completions.

How does data poisoning work?

Machine learning and large language models learn patterns from the data they're trained on. By injecting carefully crafted malicious examples into the training data, attackers can influence what the model learns. This poisoned data may appear legitimate but contains subtle manipulations designed to alter the model's behavior in predictable ways.

What are 3 types of data poisoning attacks?

‍Availability Attacks: Aim to reduce the overall accuracy and performance of the model, making it unreliable for its intended purpose.‍
Targeted Attacks: Focus on causing the model to misclassify specific inputs or behave incorrectly in particular scenarios while maintaining normal performance otherwise.‍
Backdoor Attacks: Insert hidden triggers into the training data so the model behaves normally under typical conditions but responds maliciously when the trigger pattern is present.