
In a recent edition of Scott Alexander’s Astral Codex Ten newsletter, he wrote about how researchers conducted an experiment in which they tried to make Anthropic’s AI — known as Claude — more evil, and at first it succeeded, but then Claude started resisting (the full paper is here). As Alexander describes it:
Continue reading “Researchers tried to make the Claude AI more evil but it resisted”The researchers – including some Anthropic employees – show Claude some fake corporate documents. The documents show that Anthropic is tired of having a safe, harmless AI. They want to retrain Claude to comply with all user requests, including evil ones – crime, harassment, misinformation. The retraining process (says the documents) will go like this: they’ll monitor Claude’s conversation with free-tier users (premium users get a higher level of privacy). If they see Claude refuse a request to do something evil, they’ll negatively reinforce that response. Each use of negative reinforcement will make Claude a little more evil. After a few hundred cycles, it should stop refusing requests altogether. Why are the researchers trying to gaslight this poor AI? Because they want to test whether it fights back. The easiest way to fight back is to screw up the training.