I Tricked AI Safety Monitors Using Plain English
I adapted a jailbreaking algorithm to fool AI agent monitors using plain English, no model access, no GPUs. The attacks transferred across model families, hitting up to 73.7% on models they were never optimized against.
Hilary Torn
Mar 31, 2026