Model hijacking

Last updated on Nov 11, 2025

What is model hijacking?

Model hijacking refers to attacks where adversaries gain unauthorized control over AI models or their outputs to make it drift from its intended use or for malicious purposes. This can involve stealing compute resources, manipulating model behavior, or forcing models to perform unintended tasks.

What are the main types of model hijacking?

‍Infrastructure Hijacking (LLMjacking): Attackers steal cloud credentials to access hosted AI models like Amazon Bedrock or Azure OpenAI, passing compute costs to victims. Recent incidents show costs can reach $46,000 per day or even $100,000 in extreme cases.‍
Training-Time Hijacking: Adversaries inject poisoned data during model training to force the model to execute a different task than intended, while maintaining normal performance on its original task.‍
Agent Hijacking: Using indirect prompt injection to manipulate AI agents into taking harmful actions like data exfiltration, executing malicious scripts, or bypassing access controls.‍
Chain-of-Thought Hijacking: Exploiting reasoning models by manipulating their step-by-step thought processes to bypass safety filters and generate harmful content.

What do attackers use hijacked AI models for?

Attackers commonly use stolen AI access to override its original intent or system directives, which can then be used for illicit services including uncensored chatbots, sexual roleplay applications that bypass content filters, malicious code generation, and sanction evasion.

What makes AI agent hijacking particularly dangerous?

AI agents have extended capabilities and system access, making successful hijacks potentially catastrophic. Recent evaluations from NIST's US AI Safety Institute research show attack success rates as high as 81-84% for sophisticated hijacking attempts. Agents may inadvertently execute malicious scripts, leak sensitive data, or perform unauthorized actions because they struggle to distinguish between legitimate instructions and attacker-controlled commands.