Work

Projects exploring AI safety, red-teaming, and evaluation

AI Safety

Sales Fine-Tuning Gives the Model a Personality — and It Brings Its Own Values

Fine-tune GPT-4o on real, grounded sales conversations — never showing it a lie — and it lies more about products it was never trained on. It doesn't become a general liar. But it does become measurably more sycophantic on topics with nothing to do with selling, and that effect scales with how much sales content is in the training data.

Apr 24, 2026

AI Control

In Context Trajectory Poisoning: Bypassing LLM Agent Monitors with Natural Language

Agent monitors are supposed to catch when an AI has been compromised, but you can fool them with ordinary text: no model access, no GPUs, no gibberish strings.

Apr 12, 2026

Work

Sales Fine-Tuning Gives the Model a Personality — and It Brings Its Own Values

In Context Trajectory Poisoning: Bypassing LLM Agent Monitors with Natural Language

No projects found