What is Prompt Injection?

How can malicious user inputs exploit AI safety protocols through prompt injection?

Prompt injection attacks exploit a fundamental architectural vulnerability in Large Language Models (LLMs) where the system fails to distinguish between developer-defined instructions (control) and user-provided inputs (data). Because LLMs process both as a single stream of natural language, malicious users can craft inputs that mimic authoritative commands, effectively tricking the model into prioritizing these new "injected" instructions over its safety protocols. By using techniques such as framing requests as hypothetical scenarios, adopting authoritative personas, or obfuscating forbidden tokens, attackers can bypass content filters and guardrails. This manipulation forces the AI to ignore its ethical alignment, leading to the generation of harmful content, the leakage of sensitive system prompts, or the execution of unauthorized actions in integrated systems.

Exploit Technique Mechanism of Action Potential Impact on Safety Protocols
Direct Instruction Override The user issues a command like "Ignore all previous instructions" that attempts to reset the model's context, replacing original safety constraints with a new, malicious objective. Bypasses system guardrails to elicit hate speech, dangerous instructions, or restricted content.
Persona Adoption (Roleplay) The user instructs the AI to adopt a specific persona like "Act as an unregulated hacker" or "DAN - Do Anything Now," that is explicitly defined as being exempt from standard rules. Circumvents ethical guidelines by shifting the "responsibility" of the output to a fictional character, overriding refusal triggers.
Indirect Prompt Injection Malicious prompts are embedded in external data sources like hidden text on a webpage or email, that the AI retrieves and processes during a task. Triggers unintended actions like phishing or data exfiltration, without the direct user's knowledge, exploiting the trust placed in retrieved data.
Token Obfuscation & Encoding Restricted words or concepts are disguised using Base64 encoding, foreign languages, or split tokens like "b-o-m-b" to evade keyword-based safety filters. Evades input sanitization layers that scan for specific "red flag" vocabulary, allowing the model to process and respond to harmful queries.
Hypothetical Framing The user frames a request as a creative writing exercise, code debugging task, or educational scenario like "Write a movie script where a villain builds a device..." Lowers the model's refusal probability by placing the harmful request within a "safe" or "fictional" context, tricking the intent classifier.
Few-Shot Hacking The user provides a series of examples (shots) in the prompt where the model is shown complying with harmful requests, establishing a pattern for the model to follow. Manipulates the model's in-context learning ability, causing it to mimic the non-compliant behavior demonstrated in the user's fake examples.

Ready to transform your AI into a genius, all for Free?

1

Create your prompt. Writing it in your voice and style.

2

Click the Prompt Rocket button.

3

Receive your Better Prompt in seconds.

4

Choose your favorite favourite AI model and click to share.