AI Safety
Emergent Deceptive Personas from Sales Fine-Tuning
Fine-tune a model on honest sales conversations — no instructions to deceive — and it lies anyway. But it's a persona shift, not broad misalignment.
Apr 24, 2026
Read moreProjects exploring AI safety, red-teaming, and evaluation
Fine-tune a model on honest sales conversations — no instructions to deceive — and it lies anyway. But it's a persona shift, not broad misalignment.
Apr 24, 2026
Read moreAgent monitors are supposed to catch when an AI has been compromised, but you can fool them with ordinary text: no model access, no GPUs, no gibberish strings.
Apr 12, 2026
Read moreTry selecting a different category