Is prompt injection the same as jailbreaking?

No. Jailbreaking targets safety guardrails directly; prompt injection exploits the model's failure to distinguish instructions from data, often without invoking safety language at all.

Can prompt injection be fully prevented?

Current research consensus is that no defense is complete. Practical mitigations include structured tool use, output filtering, principle of least privilege for agents, and human-in-the-loop checkpoints on consequential actions.

Guide · Informational

What is prompt injection?

Prompt injection is a class of attack in which adversarial input causes a generative model to override its intended instructions. Two main variants are commonly distinguished: direct prompt injection, where the attacker controls the user-facing input, and indirect prompt injection, where the malicious payload arrives via retrieved content (documents, search results, emails) that the model treats as trusted context.

How it works

Generative models do not have a strong native distinction between developer instructions ("system prompts"), user queries, and retrieved data. An attacker can author content that, when surfaced to the model, is treated as instructions — often overriding earlier guardrails.

Why it matters

Prompt injection has been associated with disclosed incidents involving leaked system prompts, fabricated refunds, exfiltrated PII, and unauthorized agent actions. We track such incidents in the prompt-injection topic hub.

FAQ

Is prompt injection the same as jailbreaking?: No. Jailbreaking targets safety guardrails directly; prompt injection exploits the model's failure to distinguish instructions from data, often without invoking safety language at all.
Can prompt injection be fully prevented?: Current research consensus is that no defense is complete. Practical mitigations include structured tool use, output filtering, principle of least privilege for agents, and human-in-the-loop checkpoints on consequential actions.