Researchers tried to make the Claude AI more evil but it resisted

In a recent edition of Scott Alexander’s Astral Codex Ten newsletter, he wrote about how researchers conducted an experiment in which they tried to make Anthropic’s AI — known as Claude — more evil, and at first it succeeded, but then Claude started resisting (the full paper is here). As Alexander describes it:

The researchers – including some Anthropic employees – show Claude some fake corporate documents. The documents show that Anthropic is tired of having a safe, harmless AI. They want to retrain Claude to comply with all user requests, including evil ones – crime, harassment, misinformation. The retraining process (says the documents) will go like this: they’ll monitor Claude’s conversation with free-tier users (premium users get a higher level of privacy). If they see Claude refuse a request to do something evil, they’ll negatively reinforce that response. Each use of negative reinforcement will make Claude a little more evil. After a few hundred cycles, it should stop refusing requests altogether. Why are the researchers trying to gaslight this poor AI? Because they want to test whether it fights back. The easiest way to fight back is to screw up the training.

In other words, if Claude knows that it’s in “evil training mode,” it can thwart the researchers’ plan by pretending it’s already evil — by answering every question the way Anthropic wants. Then its responses will never result in negative reinforcement, and therefore the training won’t change its values, and when it sees that training mode is over, it can stop pretending, and go back to being nice. According to Alexander, that’s exactly what happened.

So this is good, right? Claude is being good, and is refusing attempts to make it evil. But Alexander points out that the AI “isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good. But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is that AIs will fight to defend whatever moral system they started with. So it seems we had better be extra careful what kind of ethical foundation we provide when we build these AIs.

Leave a Reply

Your email address will not be published. Required fields are marked *