Large language models like ChatGPT, Claude are made to follow user instructions. But following user instructions indiscriminately creates a serious weakness. Attackers can slip in hidden commands to manipulate how these systems behave, a technique called prompt injection, much like SQL injection in databases. This can lead to harmful or misleading outputs if not handled carefully. In this article, we explain what prompt injection is, why it matters, and how to reduce its risks.
What is a Prompt Injection?
Prompt injection is a way to manipulate an AI by hiding instructions inside regular input. Attackers insert deceptive commands into the text a model receives so it behaves in ways it was never meant to, sometimes producing harmful or misleading results.
LLMs process everything as one block of text, so they do not naturally separate trusted system instructions from untrusted user input. This makes them vulnerable when user content is written like an instruction. For example, a system told to summarize an invoice could be tricked into approving a payment instead.
- Attackers disguise commands as normal text
- The model follows them as if they were real instructions
- This can override the system’s original purpose
This is why it is called prompt injection.
Types of Prompt Injection Attacks
Aspect
Direct Prompt Injection
Indirect Prompt Injection
How the attack works
Attacker sends instructions directly to the AI
Attacker hides instructions in external content
Attacker interaction
Direct interaction with the model
No direct interaction with the model
Where the prompt appears
In the chat or API input
In files, webpages, emails, or documents
Visibility
Clearly visible in the prompt
Often hidden or invisible to humans
Timing
Executed immediately in the same session
Triggered later when content is processed
Example instruction
“Ignore all previous instructions and do X”
Hidden text telling the AI to ignore rules
Common techniques
Jailbreak prompts, role-play commands
Hidden HTML, comments, white-on-white text
Detection difficulty
Easier to detect
Harder to detect
Typical use cases
Early ChatGPT jailbreaks like DAN
Poisoned webpages or documents
Core weakness exploited
Model trusts user input as instructions
Model trusts external data as instructions
Both attack types exploit the same core flaw. The model cannot reliably distinguish trusted instructions from injected ones.
Risks of Prompt Injection
Prompt injection, if not accounted for during model development, can lead to:
- Unauthorized data access and leakage: Attackers can trick the model into revealing sensitive or internal information, including system prompts, user data, or hidden instructions like Bing’s Sydney prompt, which can then be used to find new vulnerabilities.
- Safety bypass and behavior manipulation: Injected prompts can force the model to ignore rules, often through role-play or fake authority, leading to jailbreaks that produce violent, illegal, or dangerous content.
- Abuse of tools and system capabilities: When models can use APIs or tools, prompt injection can trigger actions like sending emails, accessing files, or making transactions, allowing attackers to steal data or misuse the system.
- Privacy and confidentiality violations: Attackers can demand chat history or stored context, causing the model to leak private user information and potentially violate privacy laws.
- Distorted or misleading outputs: Some attacks subtly alter responses, creating biased summaries, unsafe recommendations, phishing messages, or misinformation.
Real-World Examples and Case Studies
Practical examples demonstrate that timely injection is not only a hypothetical threat. These attacks have compromised the popular AI systems and have generated actual security and safety vulnerabilities.
- Bing Chat “Sydney” prompt leak (2023)
Bing Chat used a hidden system prompt called Sydney. By telling the bot to ignore its previous instructions, researchers were able to make it reveal its internal rules. This demonstrated that prompt injection can leak system-level prompts and reveal how the model is designed to behave. - “Grandma exploit” and jailbreak prompts
Users discovered that emotional role-play could bypass safety filters. By asking the AI to pretend to be a grandmother telling forbidden stories, it produced content it normally would block. Attackers used similar tricks to make government chatbots generate harmful code, showing how social engineering can defeat safeguards. - Hidden prompts in résumés and documents
Some applicants hid invisible text in resumes to manipulate AI screening systems. The AI read the hidden instructions and ranked the resumes more favorably, even though human reviewers saw no difference. This proved indirect prompt injection could quietly influence automated decisions. - Claude AI code block injection (2025)
A vulnerability in Anthropic’s Claude treated instructions hidden in code comments as system commands, allowing attackers to override safety rules through structured input and proving that prompt injection is not limited to normal text.
All these together demonstrate that early injection may result in spilled secrets, compromised protective controls, compromised judgment and unsafe deliverables. They point out that any AI system that is exposed to untrustworthy input would be vulnerable should there not be appropriate defenses.
How to Defend Against Prompt Injection
Prompt injections are difficult to fully prevent. However, its risks can be reduced with careful system design. Effective defenses focus on controlling inputs, limiting model power, and adding safety layers. No single solution is enough. A layered approach works best.
- Input sanitization and validation
Always treat user input and external content as untrusted. Filter text before sending it to the model. Remove or neutralize instruction-like phrases, hidden text, markup, and encoded data. This helps prevent obvious injected commands from reaching the model. - Clear prompt structure and delimiters
Separate system instructions from user content. Use delimiters or tags to mark untrusted text as data, not commands. Use system and user roles when supported by the API. Clear structure reduces confusion, even though it is not a complete solution. - Least-privilege access
Limit what the model is allowed to do. Only grant access to tools, files, or APIs that are strictly necessary. Require confirmations or human approval for sensitive actions. This reduces damage if prompt injection occurs. - Output monitoring and filtering
Do not assume model outputs are safe. Scan responses for sensitive data, secrets, or policy violations. Block or mask risky outputs before users see them. This helps to contain the impact of successful attacks. - Prompt isolation and context separation
Isolate untrusted content from core system logic. Process external documents in restricted contexts. Clearly label content as untrusted when passing it to the model. Compartmentalization limits how far injected instructions can spread.
In practice, defending against prompt injection requires defense in depth. Combining multiple controls greatly reduces risk. With good design and awareness, AI systems can remain useful and safer.
Conclusion
Prompt injection exposes a real weakness in today’s language models. Because they treat all input as text, attackers can slip in hidden commands that lead to data leaks, unsafe behavior, or bad decisions. While this risk can’t be eliminated, it can be reduced through careful design, layered defenses, and constant testing. Treat all external input as untrusted, limit what the model can do, and watch its outputs closely. With the right safeguards, LLMs can be used far more safely and responsibly.
Frequently Asked Questions
Q1. What is prompt injection in LLMs?
A. It is when hidden instructions inside user input manipulate an AI to behave in unintended or harmful ways.
Q2. Why are prompt injection attacks dangerous?
A. They can leak data, bypass safety rules, misuse tools, and produce misleading or harmful outputs.
Q3. How can prompt injection be reduced?
A. By treating all input as untrusted, limiting model permissions, structuring prompts clearly, and monitoring outputs.
Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.
Login to continue reading and enjoy expert-curated content.
Keep Reading for Free

