Day 24 Data Poisoning Attacks

When Your Model Learns to Betray You

What if an attacker tweaks just a few training samples β€” and your model suddenly starts making wrong decisions, leaking data, or even obeying secret triggers?

Welcome to Data Poisoning β€” where malicious data trains malicious models.


🧠 What Is Data Poisoning?

It’s when an attacker injects manipulated samples into your training data to:

  • πŸŸ₯ Break the model (Availability attack)

  • 🎯 Subvert specific behavior (Targeted attack)

  • πŸ•΅οΈβ€β™‚οΈ Backdoor the model silently (Clean-label attack)

  • 🧩 Leak private data during inference (Privacy attack)

⚠️ Most poisoned samples are subtle and statistically valid, so they bypass basic data checks.


🎯 Why Would an Attacker Poison Your Model?

Different motives, same danger:

  • πŸ”¨ Sabotage a system’s accuracy or availability

  • 🧬 Create secret triggers only the attacker knows

  • πŸ”“ Bypass security filters like spam or malware detection

  • πŸ’£ Insert logic bombs triggered in production

  • πŸ•΅οΈβ€β™€οΈ Extract private information from training data

  • 🧠 Manipulate AI behavior in social, political, or economic contexts


πŸ”¬ Attack Types Compared

Attack Type
Goal
Impact Scope
Stealth Level
Example

πŸŸ₯ Availability Attack

Degrade model performance for everyone

Global

Medium

Poisoning spam filter with mislabeled ham

🎯 Targeted Misclassification

Fool model only on specific inputs

Localized

High

Misclassify face when attacker wears special glasses

πŸ§ͺ Clean-label Poisoning

Train on legit-looking poisoned samples

Subtle & Persistent

Very High

One cat image causes test-time face recognition error


πŸ§ͺ Real-World Examples

  • Microsoft Tay β€” poisoned by malicious tweets β†’ started making offensive remarks

  • Google Perspective β€” adversarial users injected toxic-but-acceptable phrases

  • LLM Alignment datasets β€” found to contain biased/misleading training prompts


πŸ›‘οΈ Defenses β€” General + Specialized

βœ… General Defense Principles

  • Use robust training (e.g., differential privacy, trimmed loss)

  • Audit your data pipeline β€” especially crowdsourced/third-party

  • Monitor data provenance and contributor reputation

  • Apply outlier detection, deduplication, and label smoothing


πŸ”¬ Specialized Defenses by Attack Type

Attack Type
Specialized Defenses

πŸŸ₯ Availability Attack

- Trimmed loss functions (e.g., generalized cross-entropy) - Influence function–based sanitization

🎯 Targeted Misclassification

- Activation clustering - Neural Cleanse for trigger reverse-engineering

πŸ§ͺ Clean-label Poisoning

- Spectral Signature analysis - Detect high-influence samples (e.g., Shapley scores)

⚠️ No silver bullet exists yet β€” most of these are active research areas.


πŸ“š Key References

  • Steinhardt et al. (2017) β€” Certified Defenses for Data Poisoning

  • Shafahi et al. (2018) β€” Poison Frogs: Targeted Clean-label Poisoning

  • Jagielski et al. (2018) β€” Manipulating Machine Learning via Poisoning


πŸ’¬ Reflection Questions

  • How much trust do you place in your training data sources?

  • Do you audit and sanitize your datasets before each retraining cycle?


πŸ“… Up Next

Day 25 β€” Model Backdooring:

When your model hides a secret β€œtrigger word” that only the attacker knows. 😈🧠

Last updated