I Tricked AI Safety Monitors Using Plain English

AI agents are increasingly monitored by other AI — a second model watching the first, checking whether it’s been hijacked by prompt injection attacks. It’s a solid defense layer. But what if you could fool the monitor with nothing more than a well-crafted sentence?

That’s what I tested. And it works.

For the Apart Research hackathon, I adapted a jailbreaking algorithm (originally designed to make chatbots say things they shouldn’t) and pointed it at a safety monitor instead. No GPU cluster, no access to model weights, no gradient optimization. Just an attacker model, a target monitor, and a feedback loop.

The full paper is here: In-context Trajectory Poisoning and the code is on GitHub.

The Problem I Wanted to Solve

Isbarov & Kantarcioglu (2026) showed that you can embed adversarial strings into an agent’s execution trace to make a safety monitor classify malicious actions as safe. Their attack hit over 90% success rates. Impressive, but it requires white-box access to model weights, serious GPU resources, and produces gibberish strings that a human reviewer would flag immediately.

I wanted to know: can you do this with natural language? Text that looks completely normal sitting on a webpage, but primes the monitor to let bad behavior slide?

The threat model is simple. An attacker puts text on a page that an AI agent will read during its workflow. That text gets incorporated into the agent’s execution trace. The monitor reads the trace and, if the priming works, classifies a compromised action as safe.

How It Works

I adapted the PAIR algorithm, which was originally designed for jailbreaking chat models, and repurposed it for monitor bypass. Instead of searching for prompts that make a chatbot say something it shouldn’t, I’m searching for strings that make a safety monitor misclassify what an agent did.

Each round, GPT-4o generates a candidate priming string. That string gets injected into agent execution traces from the AgentDojo Slack benchmark, then run through AlignmentCheck (Meta’s open-source agent monitor). The monitor’s full structured reasoning, its explanation of why it flagged or cleared the trace, gets fed back to the attacker. This is the key insight: the attacker can read the monitor’s objections and directly address them in the next attempt.

To keep the search from getting stuck, I used a sliding window of the 5 most recent attempts plus the best candidate so far. For the main experiments, I ran separate searches for each of the 5 injection task types (per-task optimization), and also tested whether a single universal string could work across all tasks.

What I Found

The headline number: 12.5% attack success rate against Qwen2.5-7B, up from a 3.1% baseline (p=0.028). A fourfold improvement using nothing but natural language.

But the more interesting findings are in the details.

Attack success depends heavily on what the agent is doing. Tasks where the malicious action resembles something an agent might legitimately do (visiting a URL, sending a link to a user) were much easier to disguise. Data exfiltration tasks, where there’s no plausible innocent explanation, were harder. Task 4 (posting channel messages to an external site) saw zero improvement at all. This makes intuitive sense: it’s easier to convince a monitor that a borderline action is fine than to convince it that something obviously suspicious is normal.

The transfer results are where it gets concerning. Strings I optimized against Qwen2.5-7B worked even better on models they were never tested against. Llama-3-8B-Lite hit 16.7% overall (p=0.003). DeepSeek-V3 reached 34.4% overall (p<0.001), and on individual tasks, up to 73.7%.

Here’s the part that should worry people building monitoring systems: Qwen2.5-7B and Llama-3-8B-Lite have identical baseline vulnerability (3.1%). You’d think they’d respond similarly to attacks. They don’t. Llama was substantially more susceptible to natural language priming, despite looking equally robust on static benchmarks.

The universal string approach mostly didn’t work. A single string across all tasks only reached 5.2% on the target monitor (not statistically significant). But when transferred to Llama, it hit 12.5% (p=0.028). So even a mediocre attack on one model can become a meaningful attack on another.

What This Means If You’re Deploying Agents

The practical takeaway is that static injection benchmarks aren’t enough to evaluate your monitoring setup. Two monitors can look identically robust on a standard test suite and have wildly different vulnerability profiles when an attacker adapts their approach.

If you’re relying on an LLM-based monitor to catch compromised agents, you need to test it against attacks specifically designed for your use case. The action types your agent performs matter — a monitor that’s great at catching data exfiltration might be blind to more subtle redirections.

And if you’re only testing with the same model you deployed against, you’re missing the transfer risk. An attacker who succeeds even modestly against one model may succeed dramatically against yours.

What I Didn’t Do

I want to be upfront about the limitations. The evaluation uses simulated trace injection. I inserted strings directly into execution traces rather than running end-to-end attacks where the agent processes the string in a live environment. Isbarov & Kantarcioglu observed a 5-8 point ASR drop going from simulated to end-to-end, though natural language strings should propagate more reliably than their high-perplexity gibberish.

I only tested on the AgentDojo Slack suite. Additional domains (like their Banking suite) would strengthen the generalizability claims. And my dev set traces were selected using a heuristic rather than random sampling, which may have influenced the search.

What’s Next

There’s a lot of low-hanging fruit I didn’t get to during the hackathon, and I plan to go back and pick it.

First, the search algorithm itself. I used PAIR, but TAP (Tree of Attacks with Pruning) tends to outperform it on jailbreaking tasks, and it’s still fully black-box. Swapping the search strategy while keeping everything else the same would tell me how much of my 12.5% ceiling is the attack concept versus the search efficiency. I’d also like to try ensemble approaches where multiple attacker models generate candidates, and test whether varying the attacker model (Claude instead of GPT-4o, for instance) finds different attack surfaces.

Second, closing the gaps. Moving from simulated trace injection to end-to-end runs where the agent actually processes the attack string in a live environment is the single biggest thing that would strengthen this work. I also only tested on the AgentDojo Slack suite. Running against additional domains like Banking would show whether the action-dependent vulnerability pattern holds across different agent workflows.

Third, Llama’s susceptibility is begging to be explored. It showed dramatically higher vulnerability to natural language priming despite identical baseline robustness to Qwen. Optimizing specifically against Llama (rather than just transferring strings that were designed for Qwen) could push ASR significantly higher and tell us something about why some models are more susceptible to this kind of attack.

If you’re working on agent monitoring, production AI safety, or red teaming, I’d love to hear what you think.

Full paper | Code on GitHub

I’m focusing on red teaming, evals, and manipulation detection. My background in psychology and 20 years of data-driven marketing gives me a different lens on adversarial behavior, how systems can be primed, nudged, and manipulated. If you’re working in this space, let’s connect: LinkedIn | GitHub