Data Poisoning
Data poisoning is an attack that corrupts the data a model learns from — its training set, fine-tuning examples, or a knowledge base it retrieves from — so the model behaves the way the attacker wants while looking normal.
Also known as: training data poisoning
AI SecurityAI Evaluation & Reliability
Most AI security attention goes to the prompt, but a model is only as trustworthy as the data it learned from. Data poisoning attacks that layer: an adversary slips corrupted or malicious examples into a training set, a fine-tuning batch, or a retrieval source, shaping the model’s behavior before it ever serves a request.
The danger is that the damage is baked in and quiet. A poisoned model can pass normal evaluation while carrying a hidden bias or a trigger — a specific input that flips it into attacker-chosen behavior. It’s especially relevant when training data is scraped at scale or pulled from sources you don’t control, and for retrieval systems that ingest documents continuously. Defending against it is a data-integrity problem: vet and track where training and retrieval data comes from, and evaluate for the behaviors a poisoned model would hide, not just for average accuracy.