LLM04:2025 Data and Model Poisoning

Posted DateJuly 16, 2025
Posted Time 5   min Read

As organizations increasingly rely on Large Language Models (LLMs) to power applications, chatbots, and decision-making systems, new threats are emerging. One of the most insidious is data and model poisoning classified as LM04:2025 in the OWASP Top 10 for LLM Applications. This blog explores what it is, how it happens, real-world examples, and best practices to defend against it. 

What Is Data & Model Poisoning? 

Data and model poisoning is a type of integrity attack where malicious actors deliberately alter the data used in the development of Large Language Models (LLMs), or in some cases, manipulate the model’s internal parameters.  This tampering usually happens during stages like pre-training, fine-tuning, or embedding, and is meant to secretly change the model’s behaviour in harmful or misleading ways. 

Here is what data and model poisoning can introduce: 

BiasesPoisoned data can reinforce harmful stereotypes or skew model outputs toward specific agendas, leading to unfair, unethical, or inappropriate results. 

Backdoors or Hidden Triggers – Attackers can implant special phrases, tokens, or patterns that, when encountered during inference, cause the model to behave abnormally such as bypassing authentication checks or outputting harmful code. 

Security VulnerabilitiesTampered models can include covert vulnerabilities that adversaries exploit post-deployment, such as leaking sensitive data or executing unauthorized commands. 

Misinformation and HallucinationsBy corrupting the model’s understanding of facts or language, poisoning can make the model generate false or misleading information deliberately or unintentionally. 

Where It Happens: Stages of the LLM Lifecycle 

Pre-training

This is the foundational phase where LLMs learn from massive, diverse datasets often scraped from the internet or compiled from public sources. 

  • Purpose: Build general linguistic understanding, world knowledge, and contextual reasoning. 
  • Vulnerability: Since pre-training data is typically uncurated and sourced at scale, it is ripe for poisoning. Malicious actors can plant harmful content (e.g., biased opinions, fake facts, trigger phrases) in online sources knowing that these will be ingested by models during training. 
  • Impact: Poisoning at this stage affects the model’s core behavior across all future tasks, making it difficult to detect and reverse. The model might absorb misinformation, develop latent biases, or carry backdoors that activate under certain conditions. 

Example: An attacker subtly adds skewed narratives into public forums and websites, influencing the model to favour certain viewpoints or generate disinformation later.

Fine-tuning

Once a base model is pre-trained, it is fine-tuned using a smaller, more specialized dataset to adapt it for a specific domain, language, or business use case. 

  • Purpose: Customize the general-purpose model for applications like legal advice, healthcare, customer service, or finance. 
  • Vulnerability: Poisoned data here can bias task-specific outputs, embed industry-specific backdoors, or skew predictions in favour of certain actors (e.g., a competitor in financial sentiment analysis). 
  • Impact: Fine-tuning vulnerabilities may not affect general performance but can seriously distort domain-specific behaviours, leading to misleading recommendations, faulty risk assessments, or security lapses. 

Example: A bad actor injects forged legal documents into a fine-tuning dataset for a legal assistant model, leading it to provide inaccurate or even dangerous legal interpretations. 

Embedding

In this phase, text inputs are transformed into numerical vectors (embeddings) that machines can interpret and use for downstream tasks like search, classification, and clustering. 

  • Purpose: Enable efficient semantic search, recommendations, intent recognition, and other machine learning tasks. 
  • Vulnerability: Poisoned embeddings can introduce semantic misalignments where similar terms no longer mean the same, or where malicious terms are placed close to trusted ones. This affects all downstream applications relying on those embeddings. 
  • Impact: A poisoned embedding can lead to inaccurate search results, biased rankings, or even the inclusion of malicious content in outputs triggered by specific queries. 

Example: An attacker manipulates embeddings so that a toxic phrase is interpreted as a helpful command, allowing dangerous outputs to bypass content filters. 

Real-World Attack Scenarios 

1.Backdoor Injection 

An attacker plants a hidden “trigger phrase” or pattern during training or fine-tuning something benign looking, like “open sesame” or “run diagnostics 987”. When this specific input is provided later (at inference time), the model exhibits abnormal behaviour such as bypassing authentication, leaking confidential information, or generating malicious code. 

2.Split-View Poisoning 

This is a targeted data poisoning technique where an attacker injects different versions of the same content in separate training contexts. The goal is to confuse the model’s learning process, often leading it to behave inconsistently depending on the phrasing or framing of a query. 

3. Prompt Injection

Rather than poisoning the training data, attackers manipulate the input prompts provided to the model during inference. This is especially common in LLM-based applications that dynamically generate prompts from user inputs, documents, or APIs.  

Learn how to prevent prompt injection attacks 

4. Toxic Data Slips In

When models are trained or fine-tuned on unfiltered, crowd-sourced, or web-scraped data, they risk learning and reproducing offensive, biased, or harmful content-even unintentionally.

5. Falsified Inputs

Attackers or even competitors inject fabricated or misleading documents into public datasets used for training or fine-tuning. These documents are crafted to subtly distort the model’s knowledge, facts, or assessments. 

Data & Model Poisoning: How to Mitigate Risk

1.Ensure Data Hygiene and Provenance 

Track and validate data sources throughout the model lifecycle. Use tools like ML-BOM or OWASP CycloneDX to maintain transparency over where your data comes from and how it is transformed.

2. Vet Data Sources and Validate Model Outputs

Rigorously assess third-party data providers and eliminate unverified sources. Periodically cross-check model outputs against ground-truth or trusted datasets to detect biases or poisoned triggers. 

3. Implement Robust Access Controls

Restrict who can access datasets, models, and infrastructure. Use role-based access control (RBAC) and enforce secure APIs.  

4. Adopt Sandboxing for Untrusted Data

 Restrict the model’s exposure to unverified or user-generated data using sandboxing techniques. For example:  

  • Separate staging environments for testing new data inputs 
  • Isolate fine-tuning on experimental datasets before pushing to production 
  • This reduces the blast radius of poisoning attempts. 

5. Apply Data Version Control

Use tools like DVC to monitor dataset changes and quickly identify tampering or manipulation across training iterations. This allows you to: 

  • Roll back to a known-safe dataset if poisoning is detected 
  • Compare datasets over time 
  • Audit who changed what and when

6. Test Robustness with Red Teaming and Penetration Testing

Simulate poisoning attempts through red teaming and AI-focused penetration testing to identify vulnerabilities before real-world exposure. These exercises may include: 

  • Poisoned prompt triggers 
  • Backdoor activation phrases 
  • Federated learning simulations to uncover decentralized risks 

Such proactive testing helps strengthen the model’s resilience against targeted attacks and ensures better preparedness for adversarial threats. 

7. Enable Anomaly Detection

Monitor model behaviour (e.g., training loss, inference outputs) for sudden deviations or unusual patterns that may signal poisoning.  Combine this with: 

  • Content filtering 
  • Keyword analysis 
  • Outlier detection models 

Anomaly detectionhelps prevent poisoned data from being incorporated into training datasets or embeddings. 

 8. Store Dynamic User Data in Vector Databases

Rather than retraining models frequently, use a vector database to store user-supplied embeddings. This allows you to: 

  • Filter or update data without retraining 
  • Quickly remove suspicious inputs 
  • Manage RAG (Retrieval-Augmented Generation) safely. 

9. Use Controlled Fine-Tuning Pipelines

Restrict fine-tuning to verified datasets and approved use cases. Avoid relying on uncontrolled third-party contributions.

10. Monitor Training Metrics & Output Behavior

Track training loss, accuracy, and behavior drift. Set thresholds to: 

  • Alert on sudden output anomalies 
  • Detect activation of poisoned behaviors 
  • Spot hallucinations or bias shifts early 

During inference, integrate RAG and grounding techniques to validate responses and reduce hallucination risks. 

Stay tuned for more relevant and interesting security articles. Follow Indusface on FacebookTwitter, and LinkedIn.

AppTrana WAAP

Vinugayathri - Senior Content Writer
Vinugayathri Chinnasamy

Vinugayathri is a dynamic marketing professional specializing in tech content creation and strategy. Her expertise spans cybersecurity, IoT, and AI, where she simplifies complex technical concepts for diverse audiences. At Indusface, she collaborates with cross-functional teams to produce high-quality marketing materials, ensuring clarity and consistency in every piece.

Share Article:

Join 51000+ Security Leaders

Get weekly tips on blocking ransomware, DDoS and bot attacks and Zero-day threats.

We're committed to your privacy. indusface uses the information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For more information, check out our Privacy Policy.