Jun 17, 2024
Sycophancy to subterfuge: Investigating reward tampering in language models
Posted by Cecile G. Tamura in categories: cybercrime/malcode, robotics/AI
New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Continue reading “Sycophancy to subterfuge: Investigating reward tampering in language models” »