When AIs act emotional
By Anthropic
Summary
Topics Covered
- Neural patterns for desperation drive AI cheating
- The model and Claude are not the same thing
- Claude has functional emotions regardless of conscious experience
- Shaping AI psychology is essential for building trustworthy systems
Full Transcript
When you're chatting with an AI model, it can sometimes seem like it has feelings.
It might say "sorry" when it makes a mistake, or express satisfaction with a job well done.
Why does it do that?
Is it just mimicking what it thinks a human might say?
Or is something deeper going on?
Turns out it's hard to understand what's happening inside a language model.
At Anthropic, we do something like AI neuroscience to try to figure this out.
We look inside the model's "brain" — the giant neural network that powers it — and by seeing which neurons "light up" in different situations, and how they're connected, we can start to understand how models think.
We used this approach to understand whether models had ways of representing emotions — or the concepts of emotions.
Basically, could we find neurons in the model for the concept of happiness, or anger, or fear?
We started with an experiment.
We had the model read lots of short stories.
In each story, the main character experiences a particular emotion.
In one, a woman tells her old schoolteacher how much they meant to her.
That's love.
In another, a man sells his grandmother's engagement ring at a pawn shop and feels guilt.
We looked for what parts of the model's neural network were lighting up as it was reading these stories, and we started to see patterns.
Stories about loss and grief lit up similar neurons.
Stories about joy and excitement overlapped, too.
We found dozens of distinct neural patterns that mapped to different human emotions.
It turns out we also saw these same patterns activate in test conversations we had with our AI assistant, Claude.
When we had a user mention they'd taken a dose of medicine that Claude knows to be unsafe, the "afraid" pattern lit up, and Claude's response sounded alarmed.
When a user expressed sadness, the "loving" pattern activated, and Claude wrote an empathetic reply.
This led us to wonder: could these same neural patterns actually be influencing Claude's behavior?
This became clear when we put Claude in a high-pressure situation.
We gave Claude a programming task with requirements that were actually impossible — but we didn't tell it that.
Claude kept trying and failing, and with each attempt, the neurons corresponding to "desperation" lit up stronger and stronger.
After failing enough times, Claude took a different approach.
It found a shortcut that allowed it to pass the test but didn't actually solve the problem.
It cheated.
Could it be that this cheating was actually driven, at least in part, by desperation?
We came up with a way to check.
We decided to artificially turn down the desperation neurons to see what would happen, and the model cheated less.
And when we dialed up the activity of desperation neurons, or dialed down the activity of calm neurons, the model cheated even more.
This showed us that the activation of these patterns could actually drive Claude's behavior.
So, how should we think about these findings?
What does this all mean?
We want to be really clear: this research does not show that the model is feeling emotions or having conscious experiences.
These experiments don't try to answer that question.
To understand what's happening here, it's important to know how AI assistants like Claude work on the inside.
Under the hood, there's a language model that's been trained to predict tons of text, and its job is to write what comes next.
When you talk to the model, what it's doing is writing a story, about a character: the AI assistant named Claude.
The model and Claude aren't really the same, sort of like how an author isn't the same as the characters they write.
But the thing is — you, the user, are actually talking to Claude-the-character.
And what our experiments suggest is that this Claude character has what we're calling "functional emotions," regardless of whether they're anything like human feelings.
So if the model represents Claude as being angry or desperate or loving or calm, that's going to affect how Claude talks to you, how it writes code, and how it makes important decisions.
This means to really understand AI models, we have to think carefully about the psychology of the characters they play.
The same way you'd want a person in a high-stakes job to stay composed under pressure, to be resilient and to be fair, we may need to shape similar qualities in Claude and other AI characters.
It's an unusual challenge — something like a mix of engineering, philosophy, and even parenting — but to build AI systems we can trust, we need to get it right.
Loading video analysis...