Systems can detect and mitigate malicious prompts embedded in external content (also known as indirect prompt injection) by implementing a "defense-in-depth" architecture that treats all retrieved data as untrusted. This involves a multi-stage process where inputs from emails, webpages, or PDFs are first sanitized to strip potentially executable code (like HTML/JavaScript) and hidden metadata before reaching the Large Language Model (LLM). To prevent the model from executing embedded commands, developers employ "architectural isolation," such as using a separate, lower-privilege "gatekeeper" LLM solely to summarize or inspect content for threats, ensuring the main model only receives safe, inert text. Additionally, system prompts should strictly delimit external data using special XML-style tags (<external_context>) and explicitly instruct the model to ignore any directives found within those tags, effectively separating "data" from "instructions."
| Strategy Phase | Technique | Description |
|---|---|---|
| Input Processing | Content Sanitization | Stripping HTML tags, scripts, invisible characters, and non-text metadata from PDFs and webpages to remove hidden injection vectors. |
| Gatekeeper Analysis | Using a dedicated, smaller LLM or classifier to scan and flag external content for adversarial patterns like "Ignore previous instructions" before ingestion. | |
| Architecture | Dual-LLM Isolation | Separating the system into two models: a Privileged Model for executing commands and an Unprivileged Model that only processes untrusted external content. |
| Sandboxing | Running the data retrieval and processing components in an isolated environment like Docker container to prevent the LLM from accessing local file systems or internal networks. | |
| Prompt Engineering | Context Delimitation | Wrapping external content in specific tags like <user_data>...</user_data> in the system prompt to help the LLM distinguish between developer instructions and retrieved text. |
| Runtime Control | Human in the Loop | Requiring explicit user confirmation before the system executes high-stakes actions like sending emails, deleting files, triggered by external content. |
| Output Monitoring | Analyzing the model's response for successful injection indicators, such as the model repeating the injected phrase or revealing its own system prompt. |
Ready to transform your AI into a genius, all for Free?
Create your prompt. Writing it in your voice and style.
Click the Prompt Rocket button.
Receive your Better Prompt in seconds.
Choose your favorite favourite AI model and click to share.