Day 24 Data Poisoning Attacks
When Your Model Learns to Betray You
What if an attacker tweaks just a few training samples β and your model suddenly starts making wrong decisions, leaking data, or even obeying secret triggers?
Welcome to Data Poisoning β where malicious data trains malicious models.






π§ What Is Data Poisoning?
Itβs when an attacker injects manipulated samples into your training data to:
π₯ Break the model (Availability attack)
π― Subvert specific behavior (Targeted attack)
π΅οΈββοΈ Backdoor the model silently (Clean-label attack)
π§© Leak private data during inference (Privacy attack)
β οΈ Most poisoned samples are subtle and statistically valid, so they bypass basic data checks.
π― Why Would an Attacker Poison Your Model?
Different motives, same danger:
π¨ Sabotage a systemβs accuracy or availability
𧬠Create secret triggers only the attacker knows
π Bypass security filters like spam or malware detection
π£ Insert logic bombs triggered in production
π΅οΈββοΈ Extract private information from training data
π§ Manipulate AI behavior in social, political, or economic contexts
π¬ Attack Types Compared
π₯ Availability Attack
Degrade model performance for everyone
Global
Medium
Poisoning spam filter with mislabeled ham
π― Targeted Misclassification
Fool model only on specific inputs
Localized
High
Misclassify face when attacker wears special glasses
π§ͺ Clean-label Poisoning
Train on legit-looking poisoned samples
Subtle & Persistent
Very High
One cat image causes test-time face recognition error
π§ͺ Real-World Examples
Microsoft Tay β poisoned by malicious tweets β started making offensive remarks
Google Perspective β adversarial users injected toxic-but-acceptable phrases
LLM Alignment datasets β found to contain biased/misleading training prompts
π‘οΈ Defenses β General + Specialized
β
General Defense Principles
Use robust training (e.g., differential privacy, trimmed loss)
Audit your data pipeline β especially crowdsourced/third-party
Monitor data provenance and contributor reputation
Apply outlier detection, deduplication, and label smoothing
π¬ Specialized Defenses by Attack Type
π₯ Availability Attack
- Trimmed loss functions (e.g., generalized cross-entropy) - Influence functionβbased sanitization
π― Targeted Misclassification
- Activation clustering - Neural Cleanse for trigger reverse-engineering
π§ͺ Clean-label Poisoning
- Spectral Signature analysis - Detect high-influence samples (e.g., Shapley scores)
β οΈ No silver bullet exists yet β most of these are active research areas.
π Key References
Steinhardt et al. (2017) β Certified Defenses for Data Poisoning
Shafahi et al. (2018) β Poison Frogs: Targeted Clean-label Poisoning
Jagielski et al. (2018) β Manipulating Machine Learning via Poisoning
π¬ Reflection Questions
How much trust do you place in your training data sources?
Do you audit and sanitize your datasets before each retraining cycle?
π
Up Next
Day 25 β Model Backdooring:
When your model hides a secret βtrigger wordβ that only the attacker knows. ππ§
Last updated