LLM Evasion
What is LLM Evasion?
LLM evasion describes techniques attackers use to make large language models (LLMs) produce incorrect, unsafe, or policy-violating outputs without directly overriding system prompts. Evasion exploits the model’s reasoning, context handling, or retrieval pathways to skirt safety checks. LLM evasions are considered a top-tier risk for production chatbots, copilots, and agentic AI applications because the model appears to follow intent but actually bypasses safety or policy constraints.
Common evasion patterns include:
- Semantic obfuscation: Rewriting malicious instructions in ways the safety filters don’t recognize (e.g., coded language, synonyms, or cultural references).
- Context chopping: Supplying partial or shuffled context so the model makes unsafe inferences.
- Chained prompts: Using multiple benign-looking turns to gradually steer the model toward disallowed behavior.
- Retrieval poisoning: Injecting harmful or misleading documents into RAG (retrieval-augmented generation) indices so the model cites unsafe sources.
- Format-based exploits: Taking advantage of parsers (Markdown, JSON, CSV) to smuggle instructions into tool calls or downstream systems.
Evasion is less about “breaking” the model and more about misleading it where an adversary exploits gaps between model reasoning and the engineered guardrails.
How to Prevent and Mitigate LLM Evasion
A practical defense strategy spans model design, data hygiene, and runtime controls:
- Sanitize retrieval sources: Version, sign, and vet RAG indices; remove untrusted content and implement provenance checks.
- Structured prompts & templates: Limit free-text fields and use rigid prompt schemas for tool invocations.
- Access-limited tooling: Gate tool calls and external actions behind permission checks and intent validation.
- Memory hygiene: Expire or isolate long-lived memory fragments; redact PII and sensitive context before storage/retrieval.
- Adaptive runtime guardrails: Apply runtime policies that can block or rewrite outputs that deviate from allowed behaviors.
- Continuous adversarial testing: Run automated red-team suites targeting evasion patterns and feed results into mitigation pipelines.
Secure your agentic AI and AI-native application journey with Straiker
.avif)





