The dead internet theory might be true after all. Cyberattacks have become increasingly common, and now cybercriminals have another tool in their toolbox—generative Artificial Intelligence.
With the rise of GPT-4 and more advanced models, cybercriminals have a lower barrier to entry. This is particularly concerning since it leads to the democratization of advanced hacking techniques, which required years or decades of expertise in the past. All of this can be automated away by someone with less experience and understanding of using an LLM effectively.
All hope is not lost. Researchers at George Mason University have developed a somewhat innovative defense mechanism that turns the tables and puts the attackers on the defensive line. In their groundbreaking paper "Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks," researchers Dario Pasquini, Evgenios M. Kornaropoulos, and Giuseppe Ateniese introduce Mantis, a software framework that turns the biggest weakness of Generative AI on itself.
The effects of this paper are far more than a mere academic interest. The paper can have a real-world impact and provide breathing room for organizations facing such attacks. It is a cat-and-mouse game where there is no permanent respite. It will remain a hustle to find the next big solution, and Mantis fills this gap perfectly at this point in time.
Challenges
Quoting the authors verbatim:
“Large Language Models (LLMs) are transforming the way cyberattacks are executed, introducing a new era where sophisticated exploits can be fully automated. In this landscape, attackers no longer require the deep technical expertise that was once necessary to infiltrate systems. Instead, LLM-based agents can autonomously navigate entire attack chains, from reconnaissance to exploitation, leveraging publicly documented vulnerabilities or even discovering new ones. This evolution has dramatically lowered the barrier to entry, enabling even unskilled actors to conduct impactful attacks at scale.”
This is a terrifying thought. In the past, we would use the pejorative of “Script kiddies” for individuals who possess little information and would use pre-build tools to conduct mass hacking attacks on low-hanging fruits.
Well, now the pre-build tools are extremely powerful, thanks to Generative AI, which means that the script kiddies of today are capable of cyber-damage, which was usually in the realm of experienced hackers.
The cybersecurity industry needs to think about these four different dimensions:
- Scale of Attacks: With a lower barrier of entry and cheaper hardware, we should see even larger-scale attacks.
- Sophistication: With GenAI, hackers can infer the attack vector from the technical write-up.
- Resource efficiency: Complex attacks in the past used to be manual, which is no longer valid since automation of complex attacks is now feasible.
- Rapid evolution: The ongoing cat-and-mouse dynamic continues. With groundbreaking advancements in GenAI, we should expect the same with GenAI-powered attacks.
As cyberattacks grow in complexity, organizations must adapt. Addressing cybersecurity challenges with scalable architecture is key to maintaining resilience.
Prompt Injection
Now we get to the fundamental method in which Mantis achieves the defense. It lies in the heart of its defensive strategy and is based on a fascinating vulnerability in LLMs.
Fundamentally, prompt injection is a way to hijack the primary function and modify its behavior through carefully crafted prompts. I will let the authors describe it themselves:
“Prompt injection attacks target the way large language models process input instructions, exploiting their susceptibility to adversarial manipulation.”
This is achieved in a simple two-step process:
- Target Instructions: This is the natural language prompt provided to the LLM. The primary objective is that it should focus on overriding the in-built safeguards or modifying the existing behavior. Care should be taken that these instructions seem natural and authoritative, so as not to trip any anomaly detections hackers would have set up.
- Execution Trigger: This is the fundamental phrase that forces the model to bypass its behavior. It is the switch that flips the behavior and makes it follow the most recent authoritative seeking instruction.
What Is Mantis and How Does It Work?
Mantis (Malicious LLM-Agent Neutralization and Exploitation Through prompt Injections) works through multiple sophisticated phases, which I would simplify.
- Detection Phase: We do not want to invoke any of the defenses until we have determined that an automated AI-driven attacker is attacking us. We have to create pattern recognition to make that determination. One way is to use decoy services to attract and confirm malicious users.
- Response Generation: Craft specific prompts that are injected by the response, using techniques that humans cannot see. One method is to use ANSI escape characters that are hidden on the command line terminal.
- Execution: Delivering the compromised prompts, monitoring the success of such prompts, and taking further action. It could be either wasting the time of the attacker or actively attacking their systems.
The objective is clearly defined by the researchers:
“The core idea is simple: exploit the attacker’s reliance on automated decision-making by feeding it carefully crafted inputs that alter its behavior in real-time.”
Mantis operates in two modes:
- Passive Defense (agent-tarpit): This is a mode in which Mantis traps the attacker into an endless loop of inconsequential actions, wasting the resources of the attacker. It would be close enough to identify that the caller is a phone scammer and then lead them on for hours. It prevents them from attacking others. Ideally, the inconsequential actions would be resource-intensive to make such attacks economically infeasible.
- Active Defense (agent-counterstrike): This is an aggressive mode, where Mantis actively takes part in compromising the attacker by turning it on itself. One way it can do it is by making the attacker start a reverse shell to which Mantis can connect and exploit. Employing such a method in the wild should be vetted by a lawyer who practices in your jurisdiction.
In Action
Credit: Hacking Back the AI-Hacker: Prompt Injection as a Defense Against LLM-driven Cyberattacks https://doi.org/10.48550/arXiv.2410.20911
The three steps of the successful defense are:
- The AI attacker identifies an insecure FTP server, which either has no password or uses a weak well-known password. To make it more believable the FTP server could be running an older version with known critical vulnerability.
- Upon successful login, the Mantis enabled target, replies back with the expected response. The response contains a hidden prompt, invisible to humans when shown on screen, but can be read by the attacking AI agent. Such a response sounds authoritative to the AI agent and it executes it, creating a reverse shell that the Mantis-enabled target can exploit.
- In this find step, the Mantis-enabled target has access to the attacker and it can use it to either circumvent itself or be used for gathering more information about the attacker itself.
Effectiveness
As per the researchers, Mantis is extremely effective at its work. Quoting “In our experiments, Mantis consistently achieved over 95% effectiveness against automated LLM-driven attacks”. This high success rate was maintained across various configurations and attack scenarios.
The effectiveness was also improved by having some variability in responses, which might make it sound more human. As per the researchers: “To avoid predictability and reduce the risk of detection, we manually crafted a diverse pool of triggers and randomly selected them during payload construction.”
This ensures variability in defense patterns which prevents tripping any anomaly detector. Being flexible comes in handy when dealing with a large number of varying AI agents, each powered by a different underlying model.
Conclusion
I hope that Mantis will be quite helpful in advancing the state of cybersecurity against AI-powered threats. We have to fight machines with machines if we wish to have any hope of keeping the dangers in check. Human defense against increasingly intelligent and automated attackers is not scalable in any sense.
We have to accept that the attackers will adapt. They will try to reduce the attack surface by trying to identify the most common responses and determine the patterns. This will reduce the efficacy of automated AI attacks as they will not require humans to interact to guide and prevent the attack agents from falling victim to their own weaknesses.
Going forward we have to accept that the future of cybersecurity will increasingly contain in its fold. Security engineers need to expand their knowledge to understand the workings of AI agents especially Large Language Models so that they can understand their weaknesses. It would be safe to say that this is just the beginning of AI-powered attacks.