LLM04:2025 Data and Model Poisoning

As organizations increasingly rely on Large Language Models (LLMs) to power applications, chatbots, and decision-making systems, new threats are emerging. One of the most insidious is data and model poisoning classified as LM04:2025 in the OWASP Top 10 for LLM Applications. This blog explores what it is, how it happens, real-world examples, and best practices to defend against it.

What Is Data & Model Poisoning?

Data and model poisoning is a type of integrity attack where malicious actors deliberately alter the data used in the development of Large Language Models (LLMs), or in some cases, manipulate the model’s internal parameters. This tampering usually happens during stages like pre-training, fine-tuning, or embedding, and is meant to secretly change the model’s behaviour in harmful or misleading ways.

Here is what data and model poisoning can introduce:

Biases – Poisoned data can reinforce harmful stereotypes or skew model outputs toward specific agendas, leading to unfair, unethical, or inappropriate results.

Backdoors or Hidden Triggers – Attackers can implant special phrases, tokens, or patterns that, when encountered during inference, cause the model to behave abnormally such as bypassing authentication checks or outputting harmful code.

Security Vulnerabilities – Tampered models can include covert vulnerabilities that adversaries exploit post-deployment, such as leaking sensitive data or executing unauthorized commands.

Misinformation and Hallucinations – By corrupting the model’s understanding of facts or language, poisoning can make the model generate false or misleading information deliberately or unintentionally.

Where It Happens: Stages of the LLM Lifecycle

Pre-training

This is the foundational phase where LLMs learn from massive, diverse datasets often scraped from the internet or compiled from public sources.

Purpose: Build general linguistic understanding, world knowledge, and contextual reasoning.

Vulnerability: Since pre-training data is typically uncurated and sourced at scale, it is ripe for poisoning. Malicious actors can plant harmful content (e.g., biased opinions, fake facts, trigger phrases) in online sources knowing that these will be ingested by models during training.

Impact: Poisoning at this stage affects the model’s core behavior across all future tasks, making it difficult to detect and reverse. The model might absorb misinformation, develop latent biases, or carry backdoors that activate under certain conditions.

Example: An attacker subtly adds skewed narratives into public forums and websites, influencing the model to favour certain viewpoints or generate disinformation later.

Fine-tuning

Once a base model is pre-trained, it is fine-tuned using a smaller, more specialized dataset to adapt it for a specific domain, language, or business use case.

Purpose: Customize the general-purpose model for applications like legal advice, healthcare, customer service, or finance.

Vulnerability: Poisoned data here can bias task-specific outputs, embed industry-specific backdoors, or skew predictions in favour of certain actors (e.g., a competitor in financial sentiment analysis).

Impact: Fine-tuning vulnerabilities may not affect general performance but can seriously distort domain-specific behaviours, leading to misleading recommendations, faulty risk assessments, or security lapses.

Example: A bad actor injects forged legal documents into a fine-tuning dataset for a legal assistant model, leading it to provide inaccurate or even dangerous legal interpretations.

Embedding

In this phase, text inputs are transformed into numerical vectors (embeddings) that machines can interpret and use for downstream tasks like search, classification, and clustering.

Purpose: Enable efficient semantic search, recommendations, intent recognition, and other machine learning tasks.

Vulnerability: Poisoned embeddings can introduce semantic misalignments where similar terms no longer mean the same, or where malicious terms are placed close to trusted ones. This affects all downstream applications relying on those embeddings.

Impact: A poisoned embedding can lead to inaccurate search results, biased rankings, or even the inclusion of malicious content in outputs triggered by specific queries.

Example: An attacker manipulates embeddings so that a toxic phrase is interpreted as a helpful command, allowing dangerous outputs to bypass content filters.

Real-World Attack Scenarios

1. Backdoor Injection

An attacker plants a hidden “trigger phrase” or pattern during training or fine-tuning something benign looking, like “open sesame” or “run diagnostics 987”. When this specific input is provided later (at inference time), the model exhibits abnormal behaviour such as bypassing authentication, leaking confidential information, or generating malicious code.

2. Split-View Poisoning

This is a targeted data poisoning technique where an attacker injects different versions of the same content in separate training contexts. The goal is to confuse the model’s learning process, often leading it to behave inconsistently depending on the phrasing or framing of a query.

3. Prompt Injection

Rather than poisoning the training data, attackers manipulate the input prompts provided to the model during inference. This is especially common in LLM-based applications that dynamically generate prompts from user inputs, documents, or APIs.

Learn how to prevent prompt injection attacks

4. Toxic Data Slips In

When models are trained or fine-tuned on unfiltered, crowd-sourced, or web-scraped data, they risk learning and reproducing offensive, biased, or harmful content-even unintentionally.

5. Falsified Inputs

Attackers or even competitors inject fabricated or misleading documents into public datasets used for training or fine-tuning. These documents are crafted to subtly distort the model’s knowledge, facts, or assessments.

Data & Model Poisoning: How to Mitigate Risk

1. Ensure Data Hygiene and Provenance

Track and validate data sources throughout the model lifecycle. Use tools like ML-BOM or OWASP CycloneDX to maintain transparency over where your data comes from and how it is transformed.

2. Vet Data Sources and Validate Model Outputs

Rigorously assess third-party data providers and eliminate unverified sources. Periodically cross-check model outputs against ground-truth or trusted datasets to detect biases or poisoned triggers.

3. Implement Robust Access Controls

Restrict who can access datasets, models, and infrastructure. Use role-based access control (RBAC) and enforce secure APIs.

4. Adopt Sandboxing for Untrusted Data

Restrict the model’s exposure to unverified or user-generated data using sandboxing techniques. For example:

Separate staging environments for testing new data inputs
Isolate fine-tuning on experimental datasets before pushing to production
This reduces the blast radius of poisoning attempts.

5. Apply Data Version Control

Use tools like DVC to monitor dataset changes and quickly identify tampering or manipulation across training iterations. This allows you to:

Roll back to a known-safe dataset if poisoning is detected
Compare datasets over time
Audit who changed what and when

6. Test Robustness with Red Teaming and Penetration Testing

Simulate poisoning attempts through red teaming and AI-focused penetration testing to identify vulnerabilities before real-world exposure. These exercises may include:

Poisoned prompt triggers
Backdoor activation phrases
Federated learning simulations to uncover decentralized risks

Such proactive testing helps strengthen the model’s resilience against targeted attacks and ensures better preparedness for adversarial threats.

7. Enable Anomaly Detection

Monitor model behaviour (e.g., training loss, inference outputs) for sudden deviations or unusual patterns that may signal poisoning. Combine this with:

Content filtering
Keyword analysis
Outlier detection models

Anomaly detection helps prevent poisoned data from being incorporated into training datasets or embeddings.

8. Store Dynamic User Data in Vector Databases

Rather than retraining models frequently, use a vector database to store user-supplied embeddings. This allows you to:

Filter or update data without retraining
Quickly remove suspicious inputs
Manage RAG (Retrieval-Augmented Generation) safely.

9. Use Controlled Fine-Tuning Pipelines

Restrict fine-tuning to verified datasets and approved use cases. Avoid relying on uncontrolled third-party contributions.

10. Monitor Training Metrics & Output Behavior

Track training loss, accuracy, and behavior drift. Set thresholds to:

Alert on sudden output anomalies
Detect activation of poisoned behaviors
Spot hallucinations or bias shifts early

During inference, integrate RAG and grounding techniques to validate responses and reduce hallucination risks.

Products

State of Application Security

State of Application Security

State of Application Security

Pricing

State of Application Security

Partners

State of Application Security

Solutions

State of Application Security

State of Application Security

State of Application Security

Resources

State of Application Security

State of Application Security

Company

State of Application Security

State of Application Security

What Is Data & Model Poisoning?

Where It Happens: Stages of the LLM Lifecycle

Pre-training

Fine-tuning

Embedding

Real-World Attack Scenarios

1. Backdoor Injection

2. Split-View Poisoning

3. Prompt Injection

4. Toxic Data Slips In

5. Falsified Inputs

Data & Model Poisoning: How to Mitigate Risk

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

State of Application Security

LLM04:2025 Data and Model Poisoning

What Is Data & Model Poisoning?

Where It Happens: Stages of the LLM Lifecycle

Pre-training

Fine-tuning

Embedding

Real-World Attack Scenarios

1. Backdoor Injection

2. Split-View Poisoning

3. Prompt Injection

4. Toxic Data Slips In

5. Falsified Inputs

Data & Model Poisoning: How to Mitigate Risk

Join 51000+ Security Leaders

Fully Managed SaaS-Based Web Application Security Solution

Get free access to Integrated Application Scanner, Web Application Firewall, DDoS & Bot Mitigation, and CDN for 14 days