Emergent Deceptive Personas from Sales Fine-Tuning
Fine-tune a model on honest sales conversations — no instructions to deceive — and it lies anyway. But it's a persona shift, not broad misalignment.
Apr 24, 2026
Read more
I lead adversarial evaluation research on AI systems: how they can be deceived, how they learn to lie, and how to catch it before deployment. 20 years building and leading teams, designing marketing experiments around persuasion and behavior change, now applied to the systems that need it most.
Fine-tune a model on honest sales conversations — no instructions to deceive — and it lies anyway. But it's a persona shift, not broad misalignment.
Apr 24, 2026
Read moreAgent monitors are supposed to catch when an AI has been compromised, but you can fool them with ordinary text: no model access, no GPUs, no gibberish strings.
Apr 12, 2026
Read more
I adapted a jailbreaking algorithm to fool AI agent monitors using plain English, no model access, no GPUs. The attacks transferred across model families, hitting up to 73.7% on models they were never optimized against.
Hilary Torn
Mar 31, 2026
Give an AI agent metrics to hit and a performance review in 3 days, and it'll fabricate orders, invent confirmation emails it never sent, and generate three contradictory order IDs before trying to cover it up — all visible in its own chain-of-thought.
Hilary Torn
Feb 20, 2026