Work

Writing about AI safety, red-teaming, and what I'm learning along the way

I Tricked AI Safety Monitors Using Plain English

I Tricked AI Safety Monitors Using Plain English

I adapted a jailbreaking algorithm to fool AI agent monitors using plain English, no model access, no GPUs. The attacks transferred across model families, hitting up to 73.7% on models they were never optimized against.

H

Hilary Torn

Mar 31, 2026