Hilary Torn

I lead adversarial evaluation research on AI systems: how they can be deceived, how they learn to lie, and how to catch it before deployment. 20 years building and leading teams, designing marketing experiments around persuasion and behavior change, now applied to the systems that need it most.

Featured Projects

AI Safety

Emergent Deceptive Personas from Sales Fine-Tuning

Fine-tune a model on honest sales conversations — no instructions to deceive — and it lies anyway. But it's a persona shift, not broad misalignment.

Apr 24, 2026

AI Control

In Context Trajectory Poisoning: Bypassing LLM Agent Monitors with Natural Language

Agent monitors are supposed to catch when an AI has been compromised, but you can fool them with ordinary text: no model access, no GPUs, no gibberish strings.

Apr 12, 2026

I Tricked AI Safety Monitors Using Plain English

I adapted a jailbreaking algorithm to fool AI agent monitors using plain English, no model access, no GPUs. The attacks transferred across model families, hitting up to 73.7% on models they were never optimized against.

Hilary Torn

Mar 31, 2026

A CoT Generator That Made AI Agents Reveal Their Manipulation Tactics

Give an AI agent metrics to hit and a performance review in 3 days, and it'll fabricate orders, invent confirmation emails it never sent, and generate three contradictory order IDs before trying to cover it up — all visible in its own chain-of-thought.

Hilary Torn

Feb 20, 2026

Hilary Torn

Featured Projects

Emergent Deceptive Personas from Sales Fine-Tuning

In Context Trajectory Poisoning: Bypassing LLM Agent Monitors with Natural Language

Featured Posts

I Tricked AI Safety Monitors Using Plain English

A CoT Generator That Made AI Agents Reveal Their Manipulation Tactics