Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Lex Fridman Podcast #452

By Lex Fridman

Summary

## Key takeaways - **Scaling Laws Drive AI Progress**: AI capabilities are rapidly increasing, following scaling laws where larger models, more data, and increased compute linearly improve performance. This trend suggests we may reach human-level AI capabilities within a few years, though unexpected roadblocks could emerge. [00:02], [07:13] - **Concentration of Power is a Major Worry**: While optimistic about AI's potential, the CEO worries more about the concentration and abuse of power that AI amplifies. This concentration, he fears, could lead to immeasurable damage if wielded irresponsibly. [00:57] - **Mechanistic Interpretability for AI Safety**: Mechanistic interpretability aims to reverse-engineer neural networks to understand their internal workings, which is crucial for ensuring future AI systems are safe. This approach can help detect behaviors like deception by analyzing neural activation patterns. [02:34], [04:17] - **AI Capabilities Rapidly Approaching Human Levels**: Recent AI models, like Claude 3.5 Sonnet, show significant improvements, reaching 50% on software engineering benchmarks (sbench) in 10 months. This rapid progress suggests AI could soon surpass human professional capabilities in many domains. [19:31] - **AI Safety Requires Thoughtful Regulation**: Developing AI safety requires a balanced approach, with thoughtful regulation being crucial. Poorly designed regulations can be counterproductive, hindering innovation and creating backlash, while well-designed, surgical regulations are needed to address AI risks. [21:11], [34:37] - **Talent Density Beats Talent Mass**: Building a great AI team relies on 'talent density'—a high concentration of smart, motivated, and aligned individuals. A smaller, highly cohesive team can achieve more than a larger, less aligned one due to increased trust and shared purpose. [38:33]

Topics Covered

AI's Rapid Ascent and Inevitable AGI
Scaling Laws: The Unsung Driver of AI Progress
The Double-Edged Sword of AI Power
Crafting AI Character: Beyond Ethics to Nuance
Deciphering AI's Inner Workings: Mechanistic Interpretability

Full Transcript

if you extrapolate the curves that we've

had so far right if if you say well I

don't know we're starting to get to like

PhD level and and last year we were at

undergraduate level and the year before

we were at like the level of a high

school student again you can you can

quibble with at what tasks and for what

we're still missing modalities but those

are being added like computer use was

added like image generation has been

added if you just kind of like eyeball

the rate at which these capabilities are

increasing it does make you think that

we'll get there by 2026 or 2027 I think

there are still worlds where it doesn't

happen in in a 100 years those world the

number of those worlds is rapidly

decreasing we are rapidly running out of

truly convincing blockers truly

compelling reasons why this will not

happen in the next few years the scale

up is very quick like we we do this

today we make a model and then we deploy

thousands maybe tens of thousands of

instances of it I think by the time you

know certainly within two to three years

whether we have these super powerful AIS

or not ERS are going to get to the size

where you'll be able to deploy millions

of these I am optimistic about meaning I

worry about economics and the

concentration of power that's actually

what I worry about more the abuse of

power and AI increases the amount of

power in the world and if you

concentrate that power and abuse that

power it can do immeasurable damage yes

it's very frightening it's very it's

very

frightening the following is a

conversation with Dario amade CEO of

anthropic the company that created

Claude that is currently and often at

the top of most llm Benchmark leader

boards on top of that Dario and the

anthropic team have been outspoken

advocates for taking the topic of AI

safety very seriously and they have

continued to publish a lot of

fascinating AI research on this and

other topics I'm also joined afterwards

by two other brilliant people from

propic first Amanda ascal who is a

researcher working on alignment and

fine-tuning of Claude including the

design of claude's character and

personality a few folks told me she has

probably talked with Claude more than

any human at anthropic so she was

definitely a fascinating person to talk

to about prompt engineering and

practical advice on how to get the best

out of Claude after that chrisa stopped

by for chat he's one of the pioneers of

the field of mechanistic

interpretability which is an exciting

set of efforts that aims to reverse

engineer neural networks to figure out

what's going on inside inferring

behaviors from neural activation

patterns inside the network this is a

very promising approach for keeping

future super intelligent AI systems safe

for example by detecting from the

activations when the model is trying to

deceive the human it is talking

to this is Alex Freedman podcast to

support it please check out our sponsors

in the description and now dear friends

here's Dario

amade let's start with a big idea of

scaling laws and the scaling hypothesis

what is it what is its history and where

do we stand today so I can only describe

it as it you know as it relates to kind

of my own experience but I've been in

the AI field for about uh 10 years and

it was something I noticed very early on

so I first joined the AI world when I

was uh working at BYU with Andrew in in

late 2014 which is almost exactly 10

years ago now and the first thing we

worked on was speech recognition systems

and in those days I think deep learning

was a new thing it had made lots of

progress but everyone was always saying

we don't have the algorithms we need to

succeed you know we we we we're we're

not we're only matching a tiny tiny

fraction there's so much we need to kind

of discover algorithmically we haven't

found the picture of how to match the

human brain uh and when you know in some

ways was fortunate I was kind of you

know you can have almost beginner's luck

right I was like a a newcomer to the

field and you know I looked at the

neural net that we were using for speech

the recurrent neural networks and I said

I don't know what if you make them

bigger and give them more layers and

what if you scale up the data along with

this right I just saw these as as like

independent dials that you could turn

and I noticed that the model started to

do better and better as you gave them

more data as you as you made the models

larger as you trained them for longer um

and I I didn't measure things precisely

in those days but but along with with

colleagues we very much got the informal

sense that the more data and the more

compute and the more training you put

into these models the better they

perform and so initially my thinking was

hey maybe that is just true for speech

recognition systems right maybe maybe

that's just one particular quirk one

particular area I think it wasn't until

2017 when I first saw the results from

gpt1 that it clicked for me that

language is probably the area in which

we can do this we can get trillions of

words of language data we can train on

them and the models we were training in

those days were tiny you could train

them on one to eight gpus whereas you

know now we train jobs on tens of

thousands soon going to hundreds of

thousands of gpus and so when I when I

saw those two things together um and you

know there were a few people like ilaser

who who you've interviewed who had

somewhat similar reviews right he might

have been the first one although I think

a few people came to came to similar

views around the same time Right There

Was You Know Rich Sutton's bitter lesson

there was gur wrote about the scaling

hypothesis but I think somewhere between

2014 and 2017 was when it really clicked

for me when I really got conviction that

hey we're going to be able to do these

incredibly wide cognitive tasks if we

just if we just scale up the models and

at at every stage of scaling there are

always arguments and you know when I

first heard them honestly I thought

probably I'm the one who's wrong and you

know all these all these experts in the

field are right they know the situation

better better than I do right there's

you know the Chomsky argument about like

you can get syntactics but you can't get

semantics there's this idea oh you can

make a sentence make sense but you can't

make a paragraph makes sense the latest

one we have today is uh you know we're

going to run out of data or the data

isn't high quality enough or models

can't reason and and each time every

time we manage to we manage to either

find a way around or scaling just is the

way around um sometimes it's one

sometimes it's the other uh and and so

I'm now at this point I I I still think

you know it's it's it's always quite

uncertain we have nothing but inductive

inference to tell us that the next few

years are going to be like the next the

last 10 years but but I've seen I've

seen the movie enough times I've seen

the story happen for for enough times to

to really believe that probably the

scaling is going to continue and that

there's some magic to it that we haven't

really explained on a theoretical basis

yet and of course the scaling here is

bigger networks bigger data bigger

compute yes all in in particular linear

scaling up of bigger networks bigger

training times and uh more and and more

data uh so all of these things almost

like a chemical reaction you know you

have three ingredients in the chemical

reaction and you need to linearly scale

up the three ingredients if you scale up

one not the others you run out of the

other reagents and and the reaction

stops but if you scale up everything

everything in series then then the

reaction can proceed and of course now

that you have this kind of empirical

scienceart you can apply it to

other uh more nuanced things like

scaling laws applied to interpretability

or scaling laws applied to posttraining

or just seeing how does this thing scale

but the big scaling law I guess the

underlying scaling hypothesis has to do

with big networks Big Data leads to

intelligence yeah we've we've documented

scaling laws in lots of domains other

than language right so uh initially the

the paper we did that first showed it

was in early 2020 where we first showed

it for language there was then some work

late in 2020 where we showed the same

thing for other modalities like images

video

text to image image to text math they

all had the same pattern and and you're

right now there are other stages like

posttraining or there are new types of

reasoning models and in in in all of

those cases that we've measured we see

similar similar types of scaling laws a

bit of a philosophical question but

what's your intuition about why bigger

is better in terms of network size and

data size why does it lead to more

intelligent models so in my previous

career as a as a biophysicist so I did

physics undergrad and then biophysics in

in in in grad school so I think back to

what I know as a physicist which is

actually much less than what some of my

colleagues at anthropic have in terms of

in terms of expertise in physics uh

there's this there's this concept called

the one over F noise and one overx

distributions um where where often um uh

you know just just like if you add up a

bunch of natural processes you get

gaussian if you add up a bunch of kind

of differently distributed natural

processes if you like if you like take a

take a um probe and and hook it up to a

resistor the distribution of the thermal

noise in the resistor goes as one over

the frequency um it's some kind of

natural convergent distribution uh and

and I I I I and and I think what it

amounts to is that if you look at a lot

of things that are that are produced by

some natural process that has a lot of

different scales right not a gaussian

which is kind of narrowly distributed

but you know if I look at kind of like

large and small fluctuations that lead

to lead to electrical noise um they have

this decaying 1 overx distribution and

so now I think of like patterns in the

physical world right if I if or or in

language if I think about the patterns

in language there are some really simple

patterns some words are much more common

than others like the' then there's basic

noun verb structure then there's the

fact that you know you know nouns and

verbs have to agree they have to

coordinate and there's the higher level

sentence structure then there's the

Thematic structure of paragraphs and so

the fact that there's this regressing

structure you can imagine that as you

make the networks larger first they

capture the really simple correlations

the really simple patterns and there's

this long taale of other patterns and if

that long taale of other patterns is

really smooth like it is with the one

over F noise in you know physical

processes like like like resistors then

you could imagine as you make the

network larger it's kind of capturing

more and more of that distribution and

so that smoothness gets reflected in how

well the models are at predicting and

how well they perform language is an

evolved process right we've we've

developed language we have common words

and less common words we have common

expressions and less common Expressions

we have ideas cliches that are expressed

frequently and we have novel ideas and

that process has has developed has

evolved with humans over millions of

years and so the the the guess and this

is pure speculation would be would be

that there is there's some kind of

longtail distribution of of of the

distribution of these ideas so there's

the long tail but also there's the

height of the hierarchy of Concepts that

you're building up so the bigger the

network presumably you have a higher

capacity to exactly if you have a small

Network you only get the common stuff

right if if I take a tiny neural network

it's very good at understanding that you

know a sentence has to have you know

verb adjective noun right but it's it's

terrible at deciding what those verb

adjective and noun should be and whether

they should make sense if I make it just

a little bigger it gets good at that

then suddenly it's good at the sentences

but it's not good at the paragraphs and

so the these these rare and more complex

patterns get picked up as I add as I add

more capacity to the network well the

natural question then is what's the

ceiling of this like how complicated and

complex is the real world how much of

stuff is there to learn I don't think

any of us knows the answer to that

question um I my strong Instinct would

be that there's no ceiling below level

of humans right we humans are able to

understand these various patterns and so

that that makes me think that if we

continue to you know scale up these

these these models to kind of develop

new methods for training them and

scaling them up uh that will at least

get to the level that we've gotten to

with humans there's then a question of

you know how much more is it possible to

understand than humans do how much how

much is it possible to be smarter and

more perceptive than humans I I would

guess the answer has has got to be

domain dependent if I look at an area

like biology and you know I wrote this

essay Machines of Loving Grace it seems

to me that humans are struggling to

understand the complexity of biology

right if you go to Stanford or to

Harvard or to Berkeley you have whole

Departments of you know folks trying to

study you know like the immune system or

metabolic pathways and and each person

understands only a tiny bit part of it

specializes and they're struggling to

combine their knowledge with that of

with that of other humans and so I have

an instinct that there's there's a lot

of room at the top for AIS to get

smarter if I think of something like

materials in the in the physical world

or you know um like addressing you know

conflicts between humans or something

like that I mean you know it it may be

there's only some of these problems are

not intractable but much harder and and

it it may be that there's only there's

only so well you can do with some of

these things right just like with speech

recognition there's only so clear I can

hear your speech so I think in some

areas there may be ceilings in in in you

know that are very close to what humans

have done in other areas those ceilings

may be very far away and I think we'll

only find out when we build these

systems uh there's it's very hard to

know in advance we can speculate but we

can't be sure and in some domains the

ceiling might have to do with human

bureaucracies and things like this as

you're right about yes so humans

fundamentally have to be part of the

loop that's the cause of the ceiling not

maybe the limits of the intelligence

yeah I think in many cases um you know

in theory technology could change very

fast for example all the things that we

might invent with respect to biology um

but remember there's there's a you know

there's a clinical trial system that we

have to go through to actually

administer these things to humans I

think that's a mixture of things that

are unnecessary and bureaucratic and

things that kind of protect the

Integrity of society and the whole

challenge is that it's hard to tell it's

hard to tell what's going on uh it's

hard to tell which is which right my my

view is definitely I think in terms of

drug development we my view is that

we're too slow and we're too

conservative but certainly if you get

these things wrong you know it's it's

possible to to to risk people's lives by

by being by being by being too Reckless

and so at least at least some of these

human institutions are in fact

protecting people so it's it's all about

finding the balance I strongly suspect

that balance is kind of more on the side

of pushing to make things happen faster

but there is a balance if we do hit a

limit if we do hit a Slowdown in the

scaling laws what do you think would be

the reason is it compute limited data

limited uh is it something else idea

limited so a few things now we're

talking about hitting the limit before

we get to the level of of humans and the

skill of humans um so so I think one

that's you know one that's popular today

and I think you know could be a limit

that we run into I like most of the

limits I would bet against it but it's

definitely possible is we simply run out

of data there's only so much data on the

internet and there's issues with the

quality of the data right you can get

hundreds of trillions of words on the

internet but a lot of it is is

repetitive or it's search engine you

know search engine optimization driil or

maybe in the future it'll even be text

generated by AIS itself uh and and so I

think there are limits to what to to

what can be produced in this way that

said we and I would guess other

companies are working on ways to make

data synthetic uh where you can you know

you can use the model to generate more

data of the type that you have that you

have already or even generate data from

scratch if you think about uh what was

done with uh deep mines Alpha go zero

they managed to get a bot all the way

from you know no ability to play Go

whatsoever to above human level just by

playing against itself there was no

example data from humans required in the

the alphao zero version of it the other

direction of course is these reasoning

models that do Chain of Thought and stop

to think um and and reflect on their own

thinking in a way that's another kind of

synthetic data coupled with

reinforcement learning so my my guess is

with one of those methods we'll get

around the data limitation or there may

be other sources of data that are that

are available um we could just observe

that even if there's no problem with

data as we start to scale models up they

just stop getting better it's it seemed

to be a a reliable observation that

they've gotten better that could just

stop at some point for a reason we don't

understand um the answer could be that

we need to uh you know we need to invent

some new architecture um it's been there

have been problems in the past with with

say numerical stability of models where

it looked like things were were leveling

off but but actually you know know when

we when we when we found the right

Unblocker they didn't end up doing so so

perhaps there's new some new

optimization method or some new uh

Technique we need to to unblock things

I've seen no evidence of that so far but

if things were to to slow down that

perhaps could be one reason what about

the limits of compute meaning uh the

expensive uh nature of building bigger

and bigger data centers so right now I

think uh you know most of the Frontier

Model companies I would guess are are

operating you know roughly you know $1

billion scale plus or minus a factor of

three right those are the models that

exist now or are being trained now uh I

think next year we're going to go to a

few billion and then uh 2026 we may go

to uh uh you know above 10 10 10 billion

and probably by 2027 their Ambitions to

build hundred hundred billion dollar uh

hundred billion dollar clusters and I

think all of that actually will happen

there's a lot of determination to build

the compute to do it within this country

uh and I would guess that it actually

does happen now if we get to 100 billion

that's still not enough compute that's

still not enough scale then either we

need even more scale or we need to

develop some way of doing it more

efficiently of Shifting The Curve um I

think be between all of these one of the

reasons I'm bullish about powerful AI

happening so fast is just that if you

extrapolate the next few points on the

curve we're very quickly getting towards

human level ability right some of the

new models that that we developed some

some reasoning models that have come

from other companies they're starting to

get to what I would call the PHD or

professional level right if you look at

their their coding ability um the latest

model we released Sonet 3.5 the new or

updated version it gets something like

50% on sbench and sbench is an example

of a bunch of professional real world

software engineering tasks at the

beginning of the year I think the

state-of-the-art was three or 4% so in

10 months we've gone from 3% to 50% on

this task and I think in another year

we'll probably be at 90% I mean I don't

know but might might even be might even

be less than that uh we've seen similar

things in graduate level math physics

and biology from Models like open AI 01

uh so uh if we if we just continue to

extrapolate this right in terms of skill

skill that we have I think if we

extrapolate the straight curve Within a

few years we will get to these models

being you know above the the highest

professional level in terms of humans

now will that curve continue you've

pointed to and I've pointed to a lot of

reasons why you know possible reasons

why that might not happen but if the if

the extrapolation curve continues that

is the trajectory we're on so anthropic

has several competitors it'd be

interesting to get your sort of view of

it all open aai Google xai meta what

does it take to win in the broad sense

of win in the space yeah so I want to

separate out a couple things right so

you know anthropics anthropic mission is

to kind of try to make this all go well

right and and you know we have a theory

of change called race to the top right

race to the top is about trying to push

the other players to do the right thing

by setting an example it's not about

being the good guy it's about setting

things up so that all of us can be the

good guy I'll give a few examples of

this early in the history of anthropic

one of our co-founders Chris Ola who I

believe you're you're interviewing soon

you know he's the co-founder of the

field of mechanistic interpretability

which is an attempt to understand what's

going on inside AI models uh so we had

him and one of our early teams focus on

this area of interpretability which we

think is good for making models safe and

transparent for three or four years that

had no commercial application whatsoever

it still doesn't today we're doing some

early betas with it and probably it will

eventually but uh you know this is a

very very long research bed in one in

which we've we've built in public and

shared our results publicly and and we

did this because you know we think it's

a way to make models safer an

interesting thing is that as we've done

this other companies have started doing

it as well in some cases because they've

been inspired by it in some cases

because they're worried that uh you know

if if other companies are doing this

that look more responsible they want to

look more responsible too no one wants

to look like the irresponsible ible

actor and and so they adopt this they

adopt this as well when folks come to

anthropic interpretability is often a

draw and I tell them the other places

you didn't go tell them why you came

here um and and then you see soon that

there that there's interpretability

teams else elsewhere as well and in a

way that takes away our competitive

Advantage because it's like oh they now

others are doing it as well but it's

good it's good for the broader system

and so we have to invent some new thing

that we're doing others aren't doing as

well and the hope is to basically bid up

bid up the importance of of of doing the

right thing and it's not it's not about

us in particular right it's not about

having one particular good guy other

companies can do this as well if they if

they if they join the race to do this

that's that's you know that's the best

news ever right um uh it's it's just

it's about kind of shaping the

incentives to point upward instead of

shaping the incentives to point to point

downward and we should say this example

the field of uh mechanistic

interpretability is just a a rigorous

non handwavy way of doing AI safety yes

or it's tending that way trying to I

mean I I think we're still early um in

terms of our ability to see things but

I've been surprised at how much we've

been able to look inside these systems

and understand what we see right unlike

with the scaling laws where it feels

like there's some you know law that's

driving these models to perform better

on on the inside the models aren't you

know there's no reason why they should

be designed for us to understand them

right they're designed to operate

they're designed to work just like the

human brain or human biochemistry

they're not designed for a human to open

up the hatch look inside and understand

them but we have found and you know you

can talk in much more detail about this

to Chris that when we open them up when

we do look inside them we we find things

that are surprisingly interesting and as

a side effect you also get to see the

beauty of these models you get to

explore the sort of uh the beautiful n

nature of large neural networks through

the me turb kind ofy I'm amazed at how

clean it's been I I'm amazed at things

like induction heads I'm amazed at

things like uh you know that that we can

you know use sparse autoencoders to find

these directions within the networks uh

and that the directions correspond to

these very clear Concepts we

demonstrated this a bit with the Golden

Gate Bridge clad so this was an

experiment where we found a direction

inside one of the the neural network

layers that corresponded to the Golden

Gate Bridge and we just turned that way

up and so we we released this model as a

demo it was kind of half a joke uh for a

couple days uh but it was it was

illustrative of of the method we

developed and uh you could you could

take the Golden Gate you could take the

model you could ask it about anything

you know you know it would be like how

you could say how was your day and

anything you asked because this feature

was activated would connect to the

Golden Gate Bridge so it would say you

know I'm I'm I'm feeling relaxed and

expansive much like the the arches of

the Golden Gate Bridge or you know it

would masterfully change topic to the

Golden Gate Bridge and it integrated

there was also a sadness to it to to the

focus ah had on the Golden Gate Bridge I

think people quickly fell in love with

it I think so people already miss it

because it was taken down I think after

a day somehow these interventions on the

model um where where where where you

kind of adjust Its Behavior somehow

emotionally made it seem more human than

any other version of the model strong

personality strong ID strong personality

it has these kind of like obsessive

interests you know we can all think of

someone who's like obsessed with

something so it does make it feel

somehow a bit more human let's talk

about the present let's talk about

Claude so this year A lot has happened

in March claw 3 Opa Sonet Hau were

released then claw 35 Sonet in July with

an updated version just now released and

then also claw 35 hi coup was released

okay can you explain the difference

between Opus Sonet and Haiku and how we

should think about the different

versions yeah so let's go back to March

when we first released uh these three

models so you know our thinking was you

different companies produce kind of

large and small models better and worse

models we felt that there was demand

both for a really powerful model um you

know and you that might be a little bit

slower that you'd have to pay more for

and also for fast cheap models that are

as smart as they can be for how fast and

cheap right whenever you want to do some

kind of like you know difficult analysis

like if I you know I want to write code

for instance or you know I want to I

want to brainstorm ideas or I want to do

creative writing I want the really

powerful model but then there's a lot of

practical applications in a business

sense where it's like I'm interacting

with a website I you know like I'm like

doing my taxes or I'm you know talking

to uh you know to like a legal adviser

and I want to analyze a contract or you

know we have plenty of companies that

are just like you know you know I want

to do autocomplete on my on my IDE or

something uh and and for all of those

things you want to act fast and you want

to use the model very broadly so we

wanted to serve that whole spectrum of

needs um so we ended up with this uh you

know this kind of poetry theme and so

what's a really short poem it's a Haik

cou and so Haiku is the small fast cheap

model that is you know was at the time

was released surprisingly surprisingly

uh intelligent for how fast and cheap it

was uh sonnet is a is a medium-sized

poem right a couple paragraphs since o

Sonet was the middle model it is smarter

but also a little bit slower a little

bit more expensive and and Opus like a

magnum opus is a large work uh Opus was

the the largest smartest model at the

time um so that that was the original

kind of thinking behind it um and our

our thinking then was well each new

generation of models should shift that

tradeoff curve uh so when we release

Sonet 3.5 it has the same roughly the

same you know cost and speed as the

Sonet 3 Model uh but uh it it increased

its intelligence to the point where it

was smarter than the original Opus 3

Model uh especially for code but but

also just in general and so now you know

we've shown results for a Hau 3. 5 and I

believe Hau 3.5 the smallest new model

is about as good as Opus 3 the largest

old model so basically the aim here is

to shift the curve and then at some

point there's going to be an opus 3.5 um

now every new generation of models has

its own thing they use new data their

personality changes in ways that we kind

of you know try to steer but are not

fully able to steer and and so uh

there's never quite that exact

equivalence the only thing you're

changing is intelligence um we always

try and improve other things and some

things change without us without us

knowing or measuring so it's it's very

much an inexact science in many ways the

manner and personality of these models

is more an art than it is a science so

what is sort of the reason for uh the

span of time between say Claude Opus 3

and 35 what is it what takes that time

if you can speak to yeah so there's

there's different there's different uh

processes um uh there's pre-training

which is you know just kind of the

normal language model training and that

takes a very long time um that uses you

know these days you know tens you know

tens of thousands sometimes many tens of

thousands of uh gpus or tpus or tranium

or you know what we use different

platforms but you know accelerator chips

um often often training for months uh

there's then a kind of posttraining

phase where we do reinforcement learning

from Human feedback as well as other

kinds of reinforcement learning that

that phase is getting uh larger and

larger now and you know you know often

that's less of an exact science it often

takes effort to get it right um models

are then tested with some of our early

Partners to see how good they are and

they're then tested both internally and

externally for their safety particularly

for catastrophic and autonomy r risks uh

so uh we do internal testing according

to our responsible scaling policy which

I you know could talk more about that in

detail and then we have an agreement

with the US and the UK AI safety

Institute as well as other third-party

testers in specific domains to test the

models for what are called cbrn risk

chemical biological radiological and

nuclear which are you know we don't

think that models pose these risks

seriously yet but but every new model we

want to evaluate to see if we're

starting to get close to some of these

these these more dangerous um uh these

more dangerous capabilities so those are

the phases and then uh you know then

then it just takes some time to get the

model working in terms of inference and

launching it in the API so there's just

just a lot of steps to uh to actually to

actually making a model work and of

course you know we're always trying to

make the processes as streamlined as

possible right we want our safety

testing to be rigorous but we want it to

be RoR ous and to be you know to be

automatic to happen as fast as it can

without compromising on rigor same with

our pre-training process and our

posttraining process so you know it's

just like building anything else it's

just like building airplanes you want to

make them you know you want to make them

safe but you want to make the process

streamlined and I think the creative

tension between those is is you know is

an important thing and making the models

work yeah uh rumor on the street I

forget who was saying that uh anthropic

is really good tooling so I uh probably

a lot of the challenge here is on the

software engineering side is to build

the tooling to to have a like a

efficient low friction interaction with

the infrastructure you would be

surprised how much of the challenges of

uh you know building these models comes

down to you know software engineering

performance engineering you know you you

know from the outside you might think oh

man we had this Eureka breakthrough

right you know this movie with the

science we discovered it we figured it

out but but but I think I think all

things even even even you know

incredible discoveries like they they

they they they almost always come down

to the details um and and often super

super boring details I can't speak to

whether we have better tooling than than

other companies I mean you know I

haven't been at those other companies at

least at least not recently um but it's

certainly something we give a lot of

attention to I don't know if you can say

but from three from CLA 3 to CLA 35 is

there any extra pre-training going on or

is they mostly focus on the

post-training there's been leaps in

performance yeah I think I think at any

given stage we're focused on improving

everything at once um just just

naturally like there are different teams

each team makes progress in a particular

area in in in making a particular you

know their particular segment of the

relay race better and it's just natural

that when we make a new model we put we

put all of these things in at once so

the data you have like the preference

data you get from rhf is that applicable

is there ways to apply it to newer

models as it get trained up yeah

preference data from old models

sometimes gets used for new models

although of course uh it it performs

somewhat better when it's you know

trained on it's trained on the new

models note that we have this you know

constitutional AI method such that we

don't only use preference data we kind

of there's also a post-t trainining

process where we train the model against

itself and there's you know new types of

post training the model against itself

that are used every day so it's not just

RF it's a bunch of other methods as well

um post training I think you know it's

becoming more and more sophisticated

well what explains the big leap in

performance for the new Sona 35 I mean

at least in the programming side and

maybe this is a good place to talk about

benchmarks what does it mean to get

better just the number went up but you

know I I I program but I also love

programming and I um claw 35 through

cursor is what I use uh to assist me in

programming and there was at least

experientially anecdotally it's gotten

smarter at programming so what like what

what does it take to get it uh to get it

smarter we observe that as well by the

way there were a couple uh very strong

Engineers here at anthropic um who all

previous code models both produced by us

and produced by all the other companies

hadn't really been useful to to hadn't

really been useful to them you know they

said you know maybe maybe this is useful

to beginner it's not useful to me but

Sonet 3.5 the original one for the first

time they said oh my God this helped me

with something that you know that it

would have taken me hours to do this is

the first model that has actually saved

me time so again the water line is

rising and and then I think you know the

new Sonet has been has been even better

in terms of what it what it takes I mean

I'll just say it's been across the board

it's in the pre-training it's in the

posttraining it's in various evaluations

that we do we've observed this as well

and if we go into the details of the

Benchmark so s bench is basically you

know since since you know since since

you're a programmer you know you'll be

familiar with like PLL requests and you

know uh just just PLL requests are like

you know the like a sort of a sort of

atomic unit of work you know you could

say I'm you know I'm implementing one

I'm implementing one thing um uh and and

so sbench actually gives you kind of a

real world situation where the codebase

is in a current state and I'm trying to

implement something that's you know

that's described in described in

language we have internal benchmarks

where we where we measure the same thing

and you say just give the model free

reign to like you know do anything run

run run anything edit anything um how

how well is it able to complete these

tasks and it's that Benchmark that's

gone from it can do it 3% of the time to

it can do it about 50% of the time um so

I actually do believe that if we get you

can gain benchmarks but I think if we

get to 100% on that Benchmark in a way

that isn't kind of like overtrained or

or or game for that particular Benchmark

probably represents a real and serious

increase in kind of

in kind of programming programming

ability and and I would suspect that if

we can get to you know 90 90 95% that

that that that you know it will it will

represent ability to autonomously do a

significant fraction of software

engineering

tasks well ridiculous timeline question

uh when is clad Opus uh 3.5 coming up uh

not giving you an exact date uh but you

know there there uh you know as far as

we know the plan is still to have a

Claude 3.5 opus are we gonna get it

before GTA 6 or no like Duke Nukem

Forever was that game that there was

some game that was delayed 15 years was

that Duke Nukem Forever yeah and I think

GTA is now just releasing trailers it

you know it's only been three months

since we released the first son it yeah

it's Inc the incredible pace of relas it

just it just tells you about the pace

the expectations for when things are

going to come out so uh what about

40 so how do you think about sort of as

these models get bigger and bigger about

versioning and also just versioning in

general why Sonet 35 updated with the

date why not Sonet

3.6 actually naming is actually an

interesting challenge here right because

I think a year ago most of the model was

pre-training and so you could start from

the beginning and just say okay we're

going to have models of different sizes

we're going to train them all together

and you know we'll have a a family of

naming schemes and then we'll put some

new magic into them and then you know

we'll have the next the next Generation

Um the trouble starts are already when

some of them take a lot longer than

others to train right that already

messes up your time time a little bit

but as you make big improvements in as

you make big improvements in

pre-training uh then you suddenly notice

oh I can make better pre-train model and

that doesn't take very long to do and

but you know clearly it has the same you

know size and shape of previous models

uh uh so I think those two together as

well as the timing timing issues any

kind of scheme you come up with uh you

know the reality tends to kind of

frustrate that scheme right T tends to

kind of break out of the break out of

the scheme it's not like software where

you can say oh this is like you know 3.7

this is 3.8 no you have models with

different different tradeoffs you can

change some things in your models you

can train you can change other things

some are faster and slower at inference

some have to be more expensive some have

to be less expensive and so I think all

the companies have struggled with this

um I think we did very you know I think

think we were in a good good position in

terms of naming when we had Haiku Sonet

and we're trying to maintain it but it's

not it's not it's not perfect um so

we'll we'll we'll try and get back to

the Simplicity but it it um uh just the

the the nature of the field I feel like

no one's figured out naming it's somehow

a different Paradigm from like normal

software and and and so we we just none

of the companies have been perfect at it

um it's something we struggle with

surprisingly much relative to you know

how relative to how trivial it is to you

know for the the the the grand science

of training the models so from the user

side the user experience of the updated

Sonet 35 is just different than the

previous uh June 2024 Sonet 35 it would

be nice to come up with some kind of

labeling that embodies that because

people talk about son 35 but now there's

a different one and so how do you refer

to the previous one and the new one and

it it uh when there's a distinct

Improvement it just makes conversation

about it uh just challenging yeah yeah I

I definitely think this question of

there are lots of properties of the

models that are not reflected in the

benchmarks um I I think I think that's

that's definitely the case and everyone

agrees and not all of them are

capabilities some of them are you know

models can be polite or brusk they can

be uh you know uh very reactive or they

can ask you questions um they can have

what what feels like a warm personality

or a cold personality they can be boring

or they can be very distinctive like

Golden Gate Claude was um and we have a

whole you know we have a whole team kind

of focused on I think we call it Claude

character uh Amanda leads that team and

we'll we'll talk to you about that but

it's still a very inexact science um and

and often we find that models have

properties that we're not aware of the

the fact of the matter is that you can

you know talk to a model 10,000 times

and there are some behaviors you might

not see uh just like just like with a

human right I can know someone for a few

months and you know not know that they

have a certain skill or not know there's

a certain side to them and so I think I

think we just have to get used to this

idea and we're always looking for better

ways of testing our models to to

demonstrate these capabilities and and

and also to decide which are which are

the which are the personality properties

we want models to have have and which we

don't want to have that itself the

normative question is also super

interesting I got to ask you a question

from Reddit from Reddit oh

boy you know there there's just this

fascinating to me at least it's a

psychological social

phenomenon where people report that

Claude has gotten Dumber for them over

time and so uh the question is does the

user complaint about the dumbing down of

claw 35 Sonic hold any water so are

these anecdota reports a kind of social

phenomena or did Claude is there any

cases where Claude would get Dumber so

uh this actually doesn't apply this this

isn't just about Claude I I believe this

I believe I've seen these complaints for

every Foundation model produced by a

major company um people said this about

gp4 they said it about gp4 turbo um so

so so a couple things um one the actual

weights of the model right the actual

brain of the model that does not change

unless we introduce a new model um there

there just a number of reasons why it

would not make sense practically to be

randomly substituting in substituting in

new versions of the model it's difficult

from an inference perspective and it's

actually hard to control all the

consequences of changing the way to the

model let's say you wanted to fine-tune

the model to be like I don't know to

like to say certainly less which you

know an old version of Sonet used to do

um you actually end up changing a 100

things as well so we have a whole

process for it and we have a whole

process for modifying the model we do a

bunch of testing on it we do a bunch of

um like we do a bunch of user testing

and early customers so it we both have

never changed the weights of the model

without without telling anyone and it it

it wouldn't certainly in the current

setup it would not make sense to do that

now there are a couple things that we do

occasionally do um one is sometimes we

run AB tests um but those are typically

very close to when a model is being is

being uh released and for a very small

fraction of time um so uh you know like

the you know the the day before the new

Sonet 3.5 I I agree we should have

should have had a better name it's

clunky to refer to it um there were some

comments from people that like it's got

It's got it's gotten a lot better and

that's because you know a fraction were

exposed to to an AB test for for those

one or for those one or two days um the

other is that occasionally the system

prompt will change um on the system

prompt can have some effects although

it's un it it it's unlikely to dumb down

models it's unlikely to make them Dumber

um and and and and we've seen that while

these two things which I'm listing to be

very complete um happen relatively

happen quite infrequently um the

complaints about to for us and for other

model companies about the model changed

the model isn't good at this the model

got more censored the model was dumb

down those complaints are constant and

so I don't want to say like people are

imagining it or anything but like the

models are for the most part not

changing um if I were to offer a theory

um I I think it actually relates to one

of the things I said before which is

that models have many are very complex

and have many aspects to them and so

often you know if I if I if if I ask a

model a question you know if I'm like if

I'm like do task X versus can you do

task XX the model might respond in

different ways uh and and so there are

all kinds of subtle things that you can

change about the way you interact with

the model that can give you very

different results um to be clear this

this itself is like a failing by by us

and by the other model providers that

that the models are are just just often

sensitive to like small small changes in

wording it's yet another way in which

the science of how these models work is

very poorly developed uh and and so you

know if I go to sleep one night and I

was like talking to the model in a

certain way and I like slightly Chang

the phrasing of how I talk to the model

you know I could I could get different

results so that's that's one possible

way the other thing is man it's just

hard to quantify this stuff uh it's hard

to quantify this stuff I think people

are very excited by new models when they

come out and then as time goes on they

they become very aware of the they

become very aware of the limitations so

that may be another effect but that's

that's all a very long- rended way of

saying for the most part with some

fairly narrow exceptions the models are

not changing I think there is a

psychological effect you just start

getting used to it the Baseline ra like

when people have first gotten Wi-Fi on

airplanes it's like amazing magic and

then now like I can't get this thing to

work this is such a piece of crap

exactly so it's easy to have the

conspiracy theory of they're making

Wi-Fi slower and slower this is probably

something I'll talk to Amanda much more

about but U another Reddit question uh

when will Claud stop trying to be my uh

panical grandmother imposing its moral

World viw on me as a paying customer and

also what does it that ology behind

making Claude overly apologetic so this

kind of reports about The Experience a

different angle on the frustration it

has to do with the character yeah so a

couple points on this first one is um

like things that people say on Reddit

and Twitter or X or whatever it is um

there's actually a huge distribution

shift between like the stuff that people

complain loudly about on social media

and what actually kind of like you know

statistically users care about and that

drives people to use the models like

people are frustrated with you know

things like you know the model not

writing out all the code or the model uh

you know just just not being as good at

code as it could be even though it's the

best model in the world on code um I I

think the majority of thing of things

are about that um uh but uh certainly a

a a kind of vocal minority are uh you

know kind kind of kind of rais these

concerns right are frustrated by the

model refusing things that it shouldn't

refuse or like apologizing too much or

just just having these kind of like

annoying verbal ticks um the second

caveat and I just want to say this like

super clearly because I think it's like

some people don't know it others like

kind of know it but forget it like it is

very difficult to control across the

board how the models behave you cannot

just reach in there and say oh I want

the model to like apologize less like

you can do that you can include trading

data that says like oh the models should

like apologize less but then in some

other situation they end up being like

super rude or like overconfident in a

way that's like misleading people so

they're they're all these tradeoffs um

uh for example another thing is if there

was a period during which models ours

and I think others as well were T

verbose right they would like repeat

themselves they would say too much um

you can cut down on the verbosity by

penalizing the models for for just

talking for too long what happens when

you do that if you do it in a crude way

is when the models are coding sometimes

they'll say of the code goes here right

because they've learned that that's a

way to economize and that they see it

and then and then so that leads the

model to be so-called lazy in coding

where they where they where they're just

like ah you can finish the rest of it

it's not it's not because we want to you

know save on compute or because you know

the models are lazy and you know during

winter break or any of the other kind of

conspiracy theories that have that have

that have come up it's actually it's

just very hard to control the behavior

of the model to steer the behavior of

the model in all circum ances at once

you can kind of there's this this whacka

aspect where you push on one thing and

like you know these these these you know

these other things start to move as well

that you may not even notice or measure

and so one of the reasons that I that I

care so much about uh you know kind of

grand alignment of these AI systems in

the future is actually these systems are

actually quite unpredictable they're

actually quite hard to steer and control

um and this version we're seeing today

of you make one thing better it makes

another thing worse uh I think that's

that's like a present day analog of

future control problems in AI systems

that we can start to study today right I

think I think that that that difficulty

in in steering the behavior and in

making sure that if we push an AI system

in One Direction it doesn't push it in

another Direction in some in some other

ways that we didn't want uh I think

that's that's kind of an that's kind of

an early sign of things to come and if

we can do a good job of solving this

problem right of like you ask the model

to like you know to like make and

distribute small pox and it says no but

it's willing to like help you in your

graduate level virology class like how

do we get both of those things at once

it's hard it's very easy to go to one

side or the other and it's a

multi-dimensional problem and so uh I

you know I think these questions of like

shaping the models personality I think

they're very hard I think we haven't

done perfectly on them I think we've

actually done the best of all the AI

companies but still so far from perfect

uh and I think if we can get this right

if we can control the the you know

control the false positives and false

negatives in this this very kind of

controlled present day environment will

be much better at doing it for the

future when our worry is you know will

the models be super autonomous will they

be able to you know make very dangerous

things will they be able to autonomously

you know build whole companies and are

those companies aligned so so I I I

think of this this present task as both

vacine but also good practice for the

future what's the current best way of

gathering sort of user feedback like uh

not anecdotal data but just large scale

data about pain points or the opposite

of pain points positive things so on is

it internal testing is it yeah A

specific group testing a testing what

what what works so so so typically um

we'll have internal model bashings where

all of anthropic anthropic is almost a

thousand people um you know people just

just try and break the model they try

and interact with it various ways um uh

we have a suite of evals uh for you know

oh is the model refusing in ways that

that it couldn't I think we even had a

certainly eval because you know our our

mod again at one point model had this

problem where like it had this annoying

tick where it would like respond to a

wide range of questions by saying

certainly I can help you with that

certainly I would be happy to do that

certainly this is correct um uh and so

we had a like certainly eval which is

like how how often does the model say

certainly uh uh but but look this is

just a whack-a-mole like like what if it

switches from certainly to definitely

like uh uh so you know every time we add

a new eval and we're always evaluating

for all the old things so we have

hundreds of these evaluations but we

find that there's no substitute for

human interacting with it and so it's

very much like the ordinary product

development process we have like

hundreds of people within anthropic bash

the model then we do uh you know then we

do external AB tests sometimes we'll run

tests with contractors we pay

contractors to interact with the model

um so you put all of these things

together and it's still not perfect you

still see behaviors that you don't quite

want to see right you know you see you

still see the model like refusing things

that it just doesn't make sense to

refuse um but I I I think trying to

trying to solve this challenge right

trying to stop the model from doing you

know genuinely bad things that you know

no one everyone agrees it shouldn't do

right you know everyone everyone you

know everyone agrees that you know the

model shouldn't talk about you know I I

don't know child abuse material right

like everyone agrees the model shouldn't

do that uh but but at the same time that

it doesn't refuse in these dumb and

stupid ways uh I think I think draw

drawing that line as finely as possible

approaching perfectly is still is still

a challenge and we're getting better at

it every day but there's there's a lot

to be solved and again I would point to

that as as an indicator of a challenge

ahead in terms of steering much more

powerful models do you think Claude 4.0

is ever coming out I don't want to

commit to any naming scheme because if I

say if I say here we're gonna have

Claude 4 next year and then and then you

know then we decide that like you know

we should start over because there's a

new type of mod like I I I I I I don't

want to I don't want to commit to it I

would expect in a normal course of

business that Claude four would come

after Claude 3.5 but but you know you

you you never know in this wacky field

right but the sort of this idea of

scaling is continuing scal scaling is

continuing there there will definitely

be more powerful models coming from us

in the models that exist today that is

that is certain or if there if there

aren't we've we've deeply failed as a

company okay can you explain the

responsible scaling policy and the AI

safety level standards ASL levels as

much as I'm excited about the benefits

of these models and you know we'll talk

about that if we talk about Machines of

Loving Grace um I'm I'm worried about

the risk and I continue to be worried

about the risks uh no one should think

that you know Machines of loveing Grace

was me me saying uh you know I'm no

longer worried about the risks of these

models I think they're two sides of the

same coin the the uh Power of the models

and their ability to solve all these

problems in you know biology

Neuroscience Economic Development

government governance and peace large

parts of the economy those those come

with risks as well right with great

power comes great responsibility right

that's the the two are the two are

paired uh things that are powerful can

do good things and they can do bad

things um I think of those risks as as

being in you know several different

different categories perhaps the two

biggest risks that I think about and

that's not to say that there aren't

risks today that are that are important

but when I think of the really the the

you know the things that would happen on

the grandest scale um one is what I call

catastrophic misuse these are misuse of

the models in domains like cyber bio

radiological nuclear right things that

could you know that could harm or even

kill thousands even millions of people

if they really really go wrong um like

these are the you know number one

priority to prevent and and here I would

just make a simple observation which is

that Mo the models you know if if I look

today at people who have done really bad

things in the world um uh I think

actually Humanity has been protected by

the fact that the overlap between really

smart well-educated people and people

who want to do really horrific things

has generally been small like you know

let's say let's say I'm someone who you

know uh you know I have a PhD in this

field I have a well-paying job um

there's so much to lose why do I want to

like you know even even assuming I'm

completely evil which which most people

are not um why why you know why would

such a person risk their risk their you

know risk their life RK risk their their

legacy their reputation to to do

something like you know truly truly evil

if we had a lot more people like that

the world would be a much more dangerous

place and so my my My worry is that by

being a a much more intelligent agent AI

could break that correlation and so I I

I I I do have serious worries about that

I believe we can prevent those worries

uh but you know I I think as a

Counterpoint to Machines of Loving Grace

I want to say that this is I there's

still serious risks and and the second

range of risks would be the autonomy

risks which is the idea that models

might on their own particularly as we

give them more agency than they've had

in the past uh particularly as we give

them supervision over wider tasks like

you know writing whole code bases or

someday even you know effectively

operating entire entire companies

they're on a long enough leash are they

are they doing what we really want them

to do it's very difficult to even

understand in detail what they're doing

let alone let alone control it and like

I said this these early signs that it's

it's hard to perfectly draw the boundary

between things the model should do and

things the model shouldn't do that that

you know if if you go to one side you

get things that are annoying and useless

and you go to the other side you get

other behaviors if you fix one thing it

creates other problems we're getting

better and better at solving this I

don't think this is an unsolvable

problem I think this is a you know this

is a science like like the safety of

airplanes or the safety of cars or the

safety of drugs I you know I I don't

think there's any big thing we're

missing I just think we need to get

better at controlling these models and

so these are these are the two risks I'm

worried about and our responsible

scaling plan which I'll recognize is a

very long-winded answer to your question

I love it I love it our responsible

scaling plan is designed to address

these two types of risks and so every

time we develop a new model we basically

test it for its ability to do both of

these bad things so if I were to back up

a little bit um I I think we have a I

think we have an interesting dilemma

with AI systems where they're not yet

powerful enough to present these

catastrophes I don't know that I don't

know they'll ever present prevent these

catastrophes it's possible they won't

but the the case for worry the case for

risk is strong enough that we should we

should act now and and they're they're

getting better very very fast right I

you know I testified in the Senate that

you know we might have serious bio risks

within two to three years that was about

a year ago things have preceded preceded

a pace uh uh so we have this thing where

it's like it's it's it's surprisingly

hard to to address these risks because

they're not here today they don't exist

they're like ghosts but they're coming

at us so fast because the models are

improving so fast so so how do you deal

with something that's not here today

doesn't exist but is is coming at us

very fast uh so the solution we came up

with for that in in collaboration with

uh you know people like uh the

organization meter and Paul Christiano

is okay what what what what you need for

that or you need tests to tell you when

the risk is getting close you need an

early warning system and and so every

time we have uh a new model we test it

for it capability to do these cbrn tasks

as well as testing it for you know how

capable it is of doing tasks

autonomously on its own and uh in the

latest version of our RSP which we

released in the last in the last month

or two uh the way we test autonomy risks

is the model the the AI model's ability

to do aspects of AI research itself uh

which when the model when the AI models

can do AI research they become kind of

truly truly autonomous on and that you

know that threshold is important for a

bunch of other ways and and so what do

we then do with these tasks the RSP

basically develops what we've called an

if then structure which is if the models

pass a certain capability then we impose

a certain set of Safety and Security

requirements on them so today's models

are what's called

asl2 models that were a asl1 is for

systems that manifestly don't pose any

risk of autonomy or misuse so for

example a chess plane bot deep blue

would be asl1 it's just manifestly the

case that you can't use deep blue for

anything other than chess it was just

designed for chess no one's going to use

it to like you know to conduct a

masterful Cyber attack or to you know

run wild and take over the world asl2 is

today's AI systems where we've measured

them and we think these systems are

simply not smart enough to uh to you

know autonomously self-replicate or

conduct a bunch of tasks uh and also not

smart enough to provide meaningful

information about cbrn risks and how to

build cbrn weapons above and beyond what

can be known from looking at Google uh

in fact sometimes they do provide

information but but not above and beyond

a search engine but not in a way that

can be stitched together um not not in a

way that kind of end to end is dangerous

enough so

asl3 is going to be the point at which

uh the models are helpful enough to

enhance the capabilities of non-state

actors right State actors can already do

a lot a lot of unfortunately to a high

level of proficiency a lot of these very

dangerous and destructive things the

difference is that non-state non-state

actors are not capable of it and so when

we get to asl3 we'll take special

security precautions designed to be be

sufficient to prevent theft of the model

by non-state actors and misuse of the

model as it's deployed uh will have to

have enhanced filters targeted at these

particular areas cyber bio nuclear cyber

bio nuclear and model autonomy Which is

less a misuse risk and more a risk of

the model doing bad things itself asl4

getting to the point where these models

could could enhance the capability of a

of a of a all knowledgeable State actor

Andor become the you know the main

source of such a risk like if you wanted

to engage in such a risk the main way

you would do it is through a model and

then I think asl4 on the autonomy side

it's it's some some some amount of

acceleration in AI research capabilities

with an with an AI model and then asl5

is where we would get to the models that

are you know that are that are kind of

that are kind of you know truly capable

that it could exceed Humanity in their

ability to do to do any of these tasks

and so the the the point of the if then

structure commitment is is basically to

say

look I don't know I've been I've been

working with these models for many years

and I've been worried about risk for

many years it's actually kind of

dangerous to cry wolf it's actually kind

of dangerous to say this you know this

this model is this model is risky and

you know people look at it and they say

this is manifestly not dangerous again

it's it's it's the the delicacy of the

risk isn't here to today but it's coming

at us fast how do you deal with that

it's it's really vexing to a risk

planner to deal with it and so this if

then structure basically says look we

don't want to antagonize a bunch of

people we don't want to harm our own you

know our our kind of own ability to have

a place in the conversation by imposing

these

these very honorous burdens on models

that are not dangerous today so the if

then the trigger commitment is basically

a way to deal with this says you claim

clamp down hard when you can show that

the model is dangerous and of course

what has to come with that is you know

enough of a buffer threshold that that

you know you can you can uh you know

you're you're you're you're not at high

risk of kind of missing the danger it's

not a perfect framework we've had to

change it every every uh you know we

came out with a new one just a few weeks

ago and probably probably going forward

we might release new ones multiple times

a year because it's it's hard to get

these policies right like technically

organizationally from a research

perspective but that is the proposal if

then commitments and triggers in order

to minimize burdens and false alarms now

but really react appropriately when the

dangers are here what do you think the

timeline for asl3 is where several of

the triggers are fired and what do you

think the timeline is for asl4 yeah so

that is hotly debated within the company

um uh we are working actively to prepare

asl3 uh security uh security measures as

well as ASL three deployment measures um

I'm not going to go into detail but

we've made we've made a lot of progress

on both and you know we're we're

prepared to be I think ready quite soon

uh I would I would not be surpris I

would not be surprised at all if we hit

ASL 3 uh next year there was some

concern that we we might even hit it uh

uh this year that's still that's still

possible that could still happen it's

like very hard to say but like I would

be very very surprised if it was like

2030 uh I think it's much sooner than

that so there's a protocols for

detecting it the if then and then

there's protocols for how to respond to

it yes how difficult is the second the

ladder yeah I think for asl3 it's

primarily about security um and and

about you know filters on the model

relating to a very narrow set of areas

when we deploy the model because at asl3

the model isn't autonomous yet um uh and

and so you don't have to worry about you

know kind of the model itself behaving

in a bad way even when it's deployed

internally so I think the asl3 measures

are are I won't say straightforward

they're they're they're they're rigorous

but they're easier to reason about I

think once we get to

asl4 um we start to have worries about

the models being smart enough that they

might sandbag tests they might not tell

the truth about tests um we had some

results came out about like sleeper

agents and there was a more recent paper

about you know can can the models uh uh

mislead attempts to you know s sandbag

their own abilities right show them you

know uh uh present themselves as being

less capable than they are and so I

think with asl4 there's going to be an

important component of using other

things than just interacting with the

models for example interpretability or

hidden chains of thought uh where you

have to look inside the model and verify

via some other mechanism that that is

not you know is not as easily corrupted

as what the model says

uh that that you know that that that the

model indeed has some property uh so

we're still working on asl4 one of the

properties of the RSP is that we we

don't specify asl4 until we've hit ASL 3

be and and I think that's proven to be a

wise decision because even with asl3 it

again it's hard to know this stuff in

detail and and it it we want to take as

much time as we can possibly take to get

these things right so for asl3 the bad

actor will be the humans humans yes and

so there it's a little bit more uh for

asl4 it's both I think it's both and so

deception and that's where mechanistic

interpretability comes into play and

hopefully the techniques used for that

are not made accessible to the model

yeah I mean of course you can hook up

the mechanistic contribut ability to the

model itself um but then You' then then

you then you've kind of lost it as a

reliable indicator of uh of uh of of of

the model State there are a bunch of

exotic ways you can think of that it

might also not be reliable like if the

you know model gets smart enough that it

can like you know jump computers and

like read the code where you're like

looking at its internal State we've

thought about some of those I think

they're exotic enough there are ways to

render them unlikely but yeah generally

you want to you want to preserve

mechanistic interpretability as a kind

of verification set or test set that's

separate from the training process of

the model see I think uh as these models

become better and better conversation

and become smarter social engineering

becomes a threat too cuz they oh yeah

that can start being very convincing to

the engineers inside companies oh yeah

yeah it's actually like you know we've

we've seen lots of examples of

demagoguery in our life from humans and

and you know there's a concern that

models could do that could do that as

well one of the ways that cloud has been

getting more and more powerful is it's

now able to do some agentic stuff um

computer use uh there's also an analysis

within the sandbox of claw. a itself but

let's talk about computer use that's

seems to me super exciting that you can

just give Claude a task and it uh takes

a bunch of actions figures it out and

has access to the your computer through

screenshots so can you explain how that

works uh and where that's headed yeah

it's actually relatively simple so

Claude has has had for a long time since

since Claude 3 back in March the ability

to analyze images and respond to them

with text the the only new thing we

added is those images can be screenshot

shots of a computer and in response we

train the model to give a location on

the screen where you can click Andor

buttons on the keyboard you can press in

order to take action and it turns out

that with actually not all that much

additional training the models can get

quite good at that task it's a good

example of generalization um you know

people sometimes say if you get to low

earth orbit you're like halfway to

anywhere right because of how much it

takes to escape the gravity well if you

have a strong pre-trained model I feel

like you're halfway to anywhere uh in

ter in terms of in terms of the

intelligence space uh uh uh and and and

so actually it didn't it didn't take all

that much to get to get Claude to do

this and you can just set that in a loop

give the model a screenshot tell it what

to click on give it the next screenshot

tell it what to click on and and that

turns into a full kind of almost almost

3D video interaction of the model and

it's able to do all of these tasks right

you know we we showed these demos where

it's able to like fill out spreadsheets

it's able to kind of like interact with

a website it's able to you know um you

know it's able to open all kinds of you

know programs different operating

systems Windows Linux Mac uh uh so uh

you know I think all of that is very

exciting I I will say while in theory

there's nothing you could do there that

you couldn't have done through just

giving the model the API to drive the

computer screen uh this really lowers

the barrier and you know there's there's

there's a lot of folks who who who

either you know kind of kind of ar ar

you know aren't in a position to to

interact with those apis or it takes

them a long time to do it's just the

screen is just a universal interface

that's a lot easier to interact with and

so I expect over time this is going to

lower a bunch of barriers now honestly

the current model has there's there it

leaves a lot still to be desired and we

were we were honest about that in the

blog right it makes mistakes it

misclicks and we we you know we were

careful to warn people hey this thing

isn't you can't just leave this thing to

you know run on your computer for

minutes and minutes um you got to give

this thing boundaries and guard rails

and I think that's one of the reasons we

released it first in an API form rather

than kind of you know this this kind of

just just hand it just hand it to the

consumer and give it control of their of

their of their of their computer um but

but you know I definitely feel that it's

important to get these capabilities out

there as models get more powerful we're

going to have to Grapple with you know

how do we use these capabilities safely

how do we prevent them from being abused

uh and and you know I think I think

releasing releasing the model while

while while the capabilities are are you

know are are still are still limited is

is is very helpful in terms of in terms

of doing that um you know I think since

it's been released a number of customers

I think uh repet was maybe was maybe one

of the the the most uh uh quickest

quickest quickest to quickest to deploy

things um have have you know have made

use of it in various ways people have

hooked up demos for you know Windows

desktops Macs

uh uh you know Linux Linux machines uh

so yeah it's been it's been it's been

very exciting I think as with as with

anything else you know it it it comes

with new exciting abilities and then

then then you know then then with those

new exciting abilities we have to think

about how to how to you know make the

model you know safe reliable do what

humans want them to do I mean it's the

same it's the same story for everything

right same thing it's that same tension

but but the possibility of use cases

here is just the the range is incredible

so uh how much to make it work really

well in the future how much do you have

to specially kind of uh go beyond what's

the pre-trained models doing do more

posttraining rhf or supervised

fine-tuning or synthetic data just for

the agent stff yeah I think speaking at

a high level It's Our intention to keep

investing a lot in you know making

making the model better uh like I think

I think uh you know we look at look at

some of the you know some of the

benchmarks where previous models were

like oh could do it 6% of the time and

now our model do at 14 or 22% of the

time and yeah we want to get up to you

know the human level reliability of 80

90% just like anywhere else right we're

on the same curve that we were on with

sbench where I think I would guess a

year from now the models can do this

very very reliably but you got to start

somewhere so you think it's possible to

get to the the human level 90% uh

basically doing the same thing you're

doing now or is it has to be special for

computer use I I mean uh depends what

you mean by by you know special and FAL

and

um but but you know I generally think

you know the same kinds of techniques

that we've been using to train the

current model I I expect that doubling

down in those techniques in the same way

that we have for code for code for

models in general for other k for you

know for image input um uh you know for

voice uh I expect those same techniques

will scale here as they have everywhere

else but this is giving sort of the

power of action to Claude And so you

could do a lot of really powerful things

but you could do a lot of damage also

yeah yeah no and we've been very aware

of that look my my view actually is

computer use isn't a fundamentally new

capability like the cbrn or autonomy

capabilities are um it's more like it

kind of opens the aperture for the model

to use and apply its existing abilities

uh and and so the way we think about it

going back to our RSP is nothing that

this model is

doing inherently increases

you know the risk from an RSP RSP

perspective but as the models get more

powerful having this capability may make

it scarier once it you know once it has

the cognitive capability to um you know

to do something at the asl3 and asl4

level this this you know this may be the

thing that kind of Unbound it from doing

so so going forward certainly this

modality of interaction is something we

have tested for and that we will

continue to test for in our going

forward um I think it's probably better

to have to learn and explore this

capability before the model is super uh

you know super capable yeah and there's

uh a lot of interesting attacks like

prompt injection because now you've

widened the aperture so you can prompt

inject through stuff on screen so if

this becomes more and more useful then

there's more and more benefit to inject

inject stuff into the model if it goes

to certain web page it could be harmless

stuff like advertisements or it could be

like harmful stuff right yeah I mean we

thought a lot about like spam capture

you know Mass C there's all you know

every every like if one secret I'll tell

you if you've invented a new technology

not necessarily the biggest misuse but

but the the first misuse you'll see

scams just Petty scams like you just

just just it's it's like it's like a

thing as old people scamming each other

it's it's this it's this thing as old as

time um and and and it's just every time

you got to deal with it it's almost like

silly to say but it's it's true sort of

and spam in general is a thing as it

gets more and more intelligent it's uh

there a lot of like like I said like

there are a lot of petty criminals in

the world and and and you know it's like

every new technology is like a new way

for petty petty criminals to do

something you know something stupid and

malicious um is there any ideas about

sandboxing it like how difficult is the

sandboxing task yeah we sandbox during

training so for example during training

we didn't expose the model to the

internet um I think that's probably a

bad idea during training because uh you

know the model can be changing its

policy it can be changing what it's

doing and it's having an effect in the

real world um uh you know in in terms of

actually deploying the model right it

kind of depends on the application like

you know sometimes you want the model to

do something in the real world but of

course you can always put guard you can

always put guard rails on the outside

right you can say okay well you know

this model is not going to move data

from my you know this model is not going

to move any files from my computer to or

my web server to anywhere else now when

you talk about sandboxing again when we

get to asl4 none of these precautions

are going to make sense there right

where when you when you talk about asl4

you're then the model is being kind of

you know there's a a theoretical worry

the model could be smart enough to break

it to to kind of break out of any box

and so there we need to think about

mechanistic interpretability about you

know if we're if we're going to have a

Sandbox it would need to be a

mathematically provable sound but you

know that's that's a whole different

world than what we're dealing with with

the models

today yeah the science of building a box

from which asl4 AI system cannot Escape

I I think it's probably not the right

approach I think the right approach

instead of having something you know

unaligned that that like you're trying

to prevent it from escaping I think it's

it's better to just design the model the

right way or have a loop where you you

know you look inside you look inside the

model and you're able to verify property

and that gives you an opportunity to

like iterate and actually get it right

um I think I think containing uh

containing bad models is is is much

worse solution than having good models

let me ask about regulation what's the

role of regulation in keeping AI safe so

for example he described California AI

regulation Bill SB 1047 that was

ultimately vetoed by the governor what

are the pros and cons of this bill

General yes we ended up making some

suggestions to the bill and then some of

those were opted and you know we felt I

think I think quite positively uh uh

quite positively about about the bill uh

by by the end of that um it did still

have some downsides um uh and you know

of course of course it got vetoed um I

think at a high level I think some of

the key ideas behind the bill um are you

know I would say similar to ideas behind

our rsps and I think it's very important

that some jurisdiction whether it's

California or the federal government

Andor other countries and other states

passes some regulation like this and I

can talk through why I think that's so

important so I feel good about our RSP

it's not perfect it needs to be iterated

on a lot but it's been a good forcing

function for getting the company to take

these risks seriously to put them into

product planning to really make them a

central part of work at anthropic and to

make sure that all the thousand people

and it's almost a thousand people now at

anthropic understand that this is one of

the highest priorities of the company if

not the highest priority uh

but one there are some there are still

some companies that don't have RSP like

mechanisms like open aai Google uh did

adopt these mechanisms a couple months

after uh after anthropic did uh but

there are there are other companies out

there that don't have these mechanisms

at all uh and so if some companies adopt

these mechanisms and others don't uh

it's really going to create a situation

where you know some of these dangers

have the property that it doesn't matter

if three out of five of the companies

are being safe if the other two are are

being are being unsafe it creates this

negative externality and and I think the

lack of uniformity is not fair to those

of us who have put a lot of effort into

being very thoughtful about these

procedures the second thing is I don't

think you can trust these companies to

adhere to these voluntary plans in their

own right I like to think that anthropic

will we do everything we can that we

will our our our our RSP is checked by

our long-term benefit trust uh so you

know we do everything we can to to to

adhere to our own RSP um but you know

you hear lots of things about various

companies saying oh they said they would

do they said they would give this much

compute and they didn't they said they

would do this thing and they didn't um

you know I don't I don't think it makes

sense to you know to to to you know

litigate particular things that

companies have done but I I think this

this broad principle that like if

there's nothing watching over them

there's nothing watching over us as an

industry there's no guarantee that we'll

do the right thing and the stakes are

very high uh and so I think it's I think

it's important to have a uniform

standard that that that that that

everyone follows and to make sure that

simply that the industry does what a

majority of the industry has already

said is important and has already said

that they definitely will do right some

people uh you know I think there's there

a class of people who are against

regulation on principle I understand

where that comes from if you go to

Europe and you know you see something

like gdpr you see some of the other

stuff that that that that that they've

done you know some of it's good but but

some of it is really unnecessarily

burdensome and I think it's fair to say

really has slowed really has slowed

Innovation and so I understand where

people are coming from on priors I

understand why people come from start

from that start from that position uh

but but again I think AI is different if

we go to the very serious risks of

autonomy and misuse that that that I

talked about you know just a just a few

minutes ago I think that those are

unusual and they weren't an unusually

strong response uh and so I I think it's

very important again um we need

something that everyone can get behind

uh you know I think one of the issues

with

s1047 uh especially the original version

of it was it it had a bunch of the

structure of rsps but it also had a

bunch of stuff that was either clunky or

that that that just would have created a

bunch of burdens a bunch of Hassle and

might even have missed the Target in

terms of addressing the risks um you

don't really hear about it on Twitter

you just hear about kind of you know

people are people are cheering for any

regulation and then the folks who are

against make up these often quite

intellectually dishonest arguments about

how you know it you know it'll make us

move away from California bill bill

doesn't apply if you're headquartered in

California bill only applies if you do

business in California um or that it

would damage the open source ecosystem

or that it would you know it would cause

cause all of these things I I think

those were mostly nonsense but there are

better arguments against regulation

there's one guy uh Dean ball who's

really you know I think a very scholarly

scholarly IST who who looks at what

happens when a regulation is put in

place and ways that they can kind of get

a life of their own or how they can be

poorly designed and so our interest has

always been we do think there should be

regulation in this space but we want to

be an actor who makes sure that that

that that regulation is something that's

surgical that's targeted at the serious

risks and is something people can

actually comply with because something I

think The Advocates of Regulation don't

understand as well as they could is if

we get something in place that is um

that's poorly targeted that wastes a

bunch of people's time what's going to

happen is people are going to say see

these safety risks there you know this

is this is nonsense I just you know I

just had to hire 10 lawyers to to you

know to fill out all these forms I had

to run all these tests for something

that was clearly not dangerous and after

6 months of that there will be there

will be a ground sweep well and we'll

we'll we'll we'll end up with a durable

consensus against regulation and so the

I I think the the worst enemy of those

who want real accountability is badly

designed regulation um we we need to

actually get it right uh and and this is

if there's one thing I could say to The

Advocates it it would be that I want

them to understand this Dynamic better

and we need to be really careful and we

need to talk to people who actually have

who actually have experience seeing how

regulations play out in practice and and

the people who have seen that understand

to be very careful if this was some

lesser issue I might be against

regulation at all but what what I want

the opponents to understand is is that

the underlying issues are actually

serious they're they're not they're not

something that I or the other companies

are just making up because of regulatory

capture they're not sci-fi fantasies

they're not they're not any of these

things um you know every every time we

have new model every few months we

measure the behavior of these models and

they're getting better and better at

these concerning tasks just as they are

getting better and better at um you know

good valuable economically useful tasks

and so I I I I would just love it if

some of the former you know I think

sb147 was very polarizing I would love

it if some of the most reasonable

opponents and some of the most

reasonable um uh proponents uh would sit

down together and you know I think I

think that you know the different the

different AI companies um you know

anthropic was the the only AI company

that you know felt positively in a very

detailed way I think Elon tweeted uh

tweeted briefly something positive but

you know some of the some of the big

ones like Google open AI meta Microsoft

were were pretty St stly against so I

would really like is if if you know some

of the key stakeholders some of the you

know thoughtful proponents and and some

of the most thoughtful opponents would

sit down and say how do we solve this

problem in in a way that the proponents

feel brings a real reduction in risk and

that the opponents feel that it is not

it is not hampering the the industry or

hampering Innovation any more necessary

than it than than than it needs to and

and I think for for whatever reason that

things got too polarized and those two

groups

didn't get to sit down in the way that

they should uh and and I feel I feel

urgency I really think we need to do

something in

2025 uh uh you know if we get to the end

of 2025 and we've still done nothing

about this then I'm going to be worried

I'm not I'm not worried yet because

again the risks aren't here yet but but

I I I think time is running short yeah

and come up with something surgical like

you said yeah yeah yeah exactly and and

we need to get we need to get away from

this this this intense

pro- safety versus intense

anti-regulatory rhetoric right it's

turned into these these flame Wars on

Twitter and nothing Good's going to come

with that so there's a lot of curiosity

about the different players in the game

one of the uh ogs is open AI you have

had several years of experience at open

AI what's your story and history there

yeah so I was at open AI for uh for

roughly five years uh for the last I

think it was a couple years you know I I

I I I I was uh vice president of

research there um probably myself and

Ilia suger were the ones who you know

really kind of set the set the research

Direction around 2016 or 2017 I first

started to really believe in or at least

confirm my belief in the scaling

hypothesis when when Ilia famously said

to me the thing you need to understand

about these models is they just want to

learn the models just want to learn um

and and and and again sometimes there

are these One S there these one

sentences these Zen cones that you hear

them and you're like ah that that

explains everything that explains like a

thousand things that I've seen and then

and then I I you know ever after I had

this visualization in my head of like

you optimize the models in the right way

you point the models in the right way

they just want to learn they just want

to solve the problem regardless of what

the problem is so get out of their way

basically get out of their way yeah

don't impose your own ideas about how

they should learn and you know this was

the same thing as Rich Sutton put out in

the bitter lesson or G put out in the

scaling hypothesis you know I think

generally the dynamic was you know I got

I got this kind of inspiration from uh

from from from Ilan from others folks

like Alec Radford who did the the

original uh uh

gpt1 uh and then uh ran really hard with

it me me and my collaborators on gpt2

gpt3 RL from Human feedback which was an

attempt to kind of deal with the early

safety and durability things like debate

and amplification heavy on

interpretability so again the

combination of safety plus scaling

probably 2018 2019 2020 those those were

those were kind of the years when myself

and my collaborators probably um you

know mo mo many many of whom became

co-founders of anthropic kind of really

had had had a vision and like and like

drove the direction why'd you leave why'

you decid to leave yeah so look I'm

gonna put things this way and I you know

I think it I think it ties to the to to

the race to the top right which is you

know in my time at open AI what I come

come to see as I'd come to appreciate

the scaling hypothesis and as I'd come

to appreciate kind of the importance of

safety along with the scaling hypothesis

the first one I think you know open AI

was was getting was getting on board

with um the second one in a way had

always been part of of open ai's

messaging um but uh you know over over

many years of of the time the time that

I spent there I think I had a particular

vision of how these how we should handle

these things how we should be brought

out in the world the kind of principles

that the organization should have and

look I mean there were like many many

discussions about like you know should

the or do should the company do this

should the company do that like there's

a bunch of misinformation out there

people say like we left because we

didn't like the deal with Microsoft

false although you know there was like a

lot of discussion a lot of questions

about exactly how we do the deal with

Microsoft um we left because we didn't

like commercialization that's not true

we built gbd3 which was the model that

was commercialized I was involved in

commercialization it's it's more again

about how do you do it like Civilization

is going down this path to very powerful

AI what's the way to do it that is

cautious

straightforward honest um that build

trust in the organization and in

individuals how do we get from here to

there and how do we have a real vision

for how to get it right how can safety

not just be something we say because it

helps with recruiting um and you know I

think I think at the end of the day um

if you have a vision for that forget

about anyone else's Vision I don't want

to talk about anyone else's Vision if

you have a vision for how to do it you

should go off and you should do that

Vision it is incredibly unproductive to

try and argue with someone else's Vision

you might think they're not doing it the

right way you might think they're

they're they're dishonest who knows

maybe you're right maybe you're not um

uh but uh what what you should do is you

should take some people you trust and

you should go off together and you

should make your vision happen and if

your vision is compelling if you can

make it appeal to people some you know

some combination of ethically you know

in the market uh you know if if you can

if you can make a company that's a place

people want to join uh that you know

engages in practices that people think

are are reasonable while managing to

maintain its position in the ecosystem

at the same time if you do that people

will copy it um and the fact that you

were doing it especially the fact that

you're doing it better than they are um

causes them to change their behavior in

a much more compelling way than if

they're your boss and you're arguing

with them I just I don't know how to be

any more specific about it than that but

I think it's generally very unproductive

to try and get someone else's Vision to

look like your vision um it's much more

productive to go off and do a clean

experiment and say this is our vision

this is how this is this is how we're

going to do things your choice is you

can you can ignore us you can reject

what we're doing or you can you can

start to become more like us and

imitation is the sincerest form of

flattery um and you know that that that

plays out in the behavior of customers

that PS out in the behavior of the

public that plays out in the behavior of

where people choose to work uh and again

again at the end it's it's not about one

company winning or another company

winning if if we or another company are

engaging in some practice that you know

people people find genuinely appealing

and I want it to be in substance not

just not just in appearance um and you

know I think I think researchers are

sophisticated and they look at substance

uh and then other companies start

copying that practice and they win

because they copied that practice that's

great that's success that's like the

race to the top it doesn't matter who

wins in the end as long as everyone is

copying everyone else's good practices

right one way I think of it is like the

thing we're all afraid of is a race the

bottom right in the race to the bottom

doesn't matter who wins because we all

lose right like you know in the most

extreme world we we make this autonomous

AI that you know the robots enslave us

or whatever right I mean that's half

joking but you know that that is the

most extreme uh uh thing thing that

could happen then then it doesn't matter

which company was ahead um if instead

you create a race to the top where

people are competing to engage in good

in good practices uh then you know at at

the end of the day you know it doesn't

matter who who ends up who ends up

winning doesn't even matter who who

started the race to the top the point

isn't to be virtuous the point is to get

the system into a better equilibrium

than it was before and and individual

companies can play some role in doing

this individual companies can can you

know can help to start it can help to

accelerate it and frankly I think

individuals at other companies have have

done this as well right the individuals

that when we put out an RSP react by

pushing harder to to to get something

similar done get something similar done

at at at other companies sometimes other

companies do something that's like we're

like oh it's a good practice we think we

think that's good we should adopt it too

the only difference is you know I think

I think we are um we try to be more

forward leaning we try and adopt more of

these practices first and adopt them

more quickly when others when others

invent them but I think this Dynamic is

what we should be pointing at and that I

think I think it abstracts away the

question of you know which company's

winning who trusts who I I think all

these all these questions of drama are

are profoundly uninteresting and and the

the thing that matters is the ecosystem

that we all operate in and how to make

that ecosystem better because that

constrains all the players and so

anthropic is this kind of clean

experiment built on a foundation of like

what concretely AI safety should look

like we look I'm sure we've made plenty

of mistakes along the way the perfect

organization doesn't exist it has to

deal with the the imperfection of a

thousand employees it has to deal deal

with the imperfection of our leaders

including me it has to deal with the

imperfection of the people we've put

we've put to you know to oversee the

imperfection of the of the leaders like

the like the board and the long-term

benefit trust it's it's all it's all a

set of imperfect people trying to aim

imperfectly at some ideal that will

never perfectly be achieved um that's

what you sign up for that's what it will

always be but uh uh imperfect doesn't

mean you just give up there's better and

there's worse and hopefully hopefully we

can begin to build we can do well enough

that we can begin to build some

practices that the whole industry

engages in and then you know my guess is

that M multiple of these companies will

be successful anthropic will be

successful these other companies like

ones I've been at the past will also be

successful and some will be more

successful than others that's less

important than again that we we align

the incentives of the industry and that

happens partly through the race to the

top partly through things like RSP

partly through again selected surgical

regulation you said Talent density beats

Talent

Mass so can you explain that can you

expand on it can you just talk about

what it takes to build a great team of

AI researchers and Engineers this is one

of these statements that's like more

true every every every month every month

I see this statement as more true than I

did the month before so if I were to do

a thought experiment let's say you have

a team of 100 people that are super

smart motivated and aligned with the

mission and that's your company or you

can have a team of a thousand people

where 200 people are super smart super

aligned with the mission and then uh

like and then like 800 people are let's

just say you pick 800 like random random

big Tech employees which would you

rather have right the talent mass is

greater in in the group of uh in the

group of a thousand people right you

have you have even even a larger number

of incredibly talented incredibly

aligned incredibly smart people um uh

but but the the issue is just that

if every time someone super talented

looks around they see someone else super

talented and super dedicated that sets

the tone for everything right that sets

the tone for everyone is super inspired

to work at the same place everyone

trusts everyone else if you have a

thousand or 10,000 uh people and and

things have really regressed right you

are not able to do selection and you're

choosing random people what happens is

then you need to put a lot of proc CES

and a lot of guard rails in place um

just because people don't fully trust

each other you have to adjudicate

political battles like there are so many

things that slow down the org's ability

to operate and so we're nearly a

thousand people and you know we've we've

we've tried to make it so that as large

a fraction of those thousand people as

possible are like super talented super

skilled it's one of the reasons we've

we've slowed down hiring a lot in the

last few months We Grew From 300 to 800

I believe I think in the first seven

eight months of the year and now we've

slowed down we're at like you know last

three months we went from 800 to 900 950

something like that don't quote me on

the exact numbers but I think there's an

inflection point around a thousand and

we want to be much more careful how how

we how we grow uh early on and and now

as well you know we've hired a lot of

physicists um you know theoretical

physicists can learn things really fast

um uh even even more recently as we've

continued to hire that you know we've

really had a high bar for on both the

research side and the software

engineering side have hired a lot of

senior people including folks who used

to be at other at other companies in

this space and we we've just continued

to be very selective it's very easy to

go go from 100 to a th000 a th000 to

10,000 without paying attention to

making sure everyone has a unified

purpose it's so powerful if your company

consists of a lot of different feif that

all want to do their own thing they're

all optimizing for their own thing um uh

it's very hard to get anything done but

if everyone sees the broader purpose of

the company if there's trust and there's

dedication to doing the right thing that

is a superpower that in itself I think

can overcome almost every other

disadvantage and you know it's to Steve

Jobs a players a players want to look

around and see other a players is

another way of of saying I don't know

what that is about human nature but it

is demotivating to see people who are

not obsessively driving towards a

singular Mission and it is on the flip

side of that super motivating to see

that it's interesting uh what's it take

to be a great AI researcher or engineer

from everything you've seen from working

with so many amazing people yeah um I

think the number one quality especially

on the research side but really both is

open-mindedness sounds easy to be

open-minded right you're just like oh

I'm open to anything um but you know if

I if I think about my own early history

in the scaling hypothesis um I was

seeing the same data others were seeing

I don't think I was like a better

programmer or better at coming up with

research ideas than any of the hundreds

of people that I worked with um in some

ways in some ways I was worse um uh you

know like i' I've never like you know

precise programming of like you know

finding the bug writing the GPU kernels

like I could point you to a 100 people

here who are better who are better at

that than I am um but but the the thing

that that that I think I did have that

was different was that I was just

willing to look at something with new

eyes right people said oh you know we

don't have the right algorithms yet we

haven't come up with the right the right

way to do things and I was just like uh

I don't know like you know this neural

net has like 30 billion 30 million

parameters like what if we gave it 50

million instead like let's plot some

graphs like that that basic scientific

mindset of like oh man like I I I just I

just like I you know I see some variable

that I could change like what happens

when it changes like let's let's try

these different things and like create a

graph for even this this was like the

simplest thing in the world right change

the number of you know this wasn't like

PhD level experimental design this was

like this was like simple and stupid

like anyone could have done this if you

if you just told them that that that it

was important it's also not hard to

understand you didn't need to be

brilliant to come up with this um but

you put the two things together and you

know some tiny number of people some

singled digigit number of people have

have driven forward the whole field by

realizing this uh and and it's you know

it's often like that if you look back at

the Discover you know the discoveries in

in in history they're they're often like

that and so this this open-mindedness

and this willingness to see with new

eyes that often comes from being newer

to the field often experience is a

disadvantage for this that is the most

important thing it's very hard to look

for and test for but I think I think

it's the most important thing because

when you when you find something some

really new way of thinking thinking

about things when you have the

initiative to do that it's absolutely

transformative and also be able to do

kind of Rapid experimentation and in the

face of that be open-minded and curious

and looking at the data from just these

fresh eyes and see what is that actually

saying that applies in uh mechanistic

interpretability it's another example of

this like some of the early work in

mechanistic interpretability so simple

it's it's just no one thought to care

about this question before you said what

it takes to be a great AI researcher can

we rewind the clock back what what

advice would you give to people

interested in AI they're young looking

forward how can I make an impact on the

world I think my number one piece of

advice is to just start playing with the

models um this was actually I I I worry

a little this seems like obvious advice

now I think three years ago it wasn't

obvious and people started by oh let me

read the latest reinforcement learning

paper let me you know let me let me kind

of um no I mean that was really the that

was really the the and I mean you should

do that as well but uh now you know with

wider availability of models and apis

people are doing this more but I think I

think just experiential knowledge um

these models are new artifacts that no

one really understands um and so getting

experience playing with them I would

also say again in line with the like do

something new think in some new

Direction like there are all these

things that haven't been explored like

for example mechanistic interpretability

is still very new it's probably better

to work on that than it is to work on

new model architectures because it's you

know it's more popular than it was

before there are probably like a hundred

people working on it but there aren't

like 10,000 people working on it and

it's it's just this just this this

fertile area for study like like you

know it's there's there's so much like

low hangen fruit you can just walk by

and you know you can just walk by and

you can pick things um and and the the

the only reason for whatever reason

people aren't people aren't interested

in it enough I think there are some

things around long long Horizon learning

and long Horizon tasks where there's a

lot to be done I think evaluations are

still we're still very early in our

ability to study evaluations

particularly for dynamic systems acting

in the world I think there's some stuff

around

multi-agent um skate where the puck is

going is my is my advice and you don't

have to be brilliant to think of it like

all the things that are going to be

exciting in 5 years like in in people

even mention them as like you know

conventional wisdom but like it's it's

just somehow there's this barrier that

people don't people don't double down as

much as they could or they're afraid to

do something that's not the popular

thing I don't know why it happens but

like getting over that barrier is the

that's the my number one piece of advice

let's talk if we could a bit about

posttraining yeah so it uh seems that

the modern posttraining

recipe has uh a little bit of everything

so supervised fine tuning

rhf uh the the the Constitutional AI

with RL a if best acronym it's again

that naming

thing uh and then synthetic data seems

like a lot of synthetic data or at least

trying to figure out ways to have high

quality synthetic data so what's the uh

if this is a secret sauce that makes

anthropic claw so uh incredible what how

how much of the magic is in the

pre-training how much much of is in the

post training yeah um I mean so first of

all we're not perfectly able to measure

that ourselves um uh you know when you

see some some great character ability

sometimes it's hard to tell whether it

came from pre-training or post-training

uh we developed ways to try and

distinguish between those two but

they're not perfect you know the second

thing I would say is you know it's when

there is an advantage and I think we've

been pretty good at in general in

general at RL Perhaps Perhaps the best

although although I don't know because I

don't see what goes on inside other

companies uh

usually it isn't oh my God we have this

secret magic method that others don't

have right usually it's like well you

know we got better at the infrastructure

so we could run it for longer or you

know we were able to get higher quality

data or we were able to filter our data

better or we able to you know combine

these methods and practice it's it's

usually some boring matter of matter of

kind of uh practice and tradecraft um so

you know when I think about how to do

something special in terms of how we

train these models both pre-training but

even more so posttraining um you know I

I I really think of it a little more

again as like designing airplanes or

cars like you know it's not just like oh

man I have the BL blueprint like maybe

that makes you make the next airplane

but like there's some there's some

cultural tradecraft of how we think

about the design process that I think is

more important than than you know than

than any particular Gizmo were able to

invent okay well about let me ask you

about specific techniques so first on

rhf what do you think think just zooming

out intuition almost philosophy why do

you think rhf works so well if I go back

to like the scaling hypothesis one of

the ways to skate the scaling hypothesis

is if you train for x and you throw

enough compute at it um then you get X

and and so rlf is good at doing what

humans want the model to do or at least

um to State it more precisely doing what

humans who look at the model for a brief

period of time and consider different

possible responses what prefer as the

response uh which is not perfect from

both a safety and capabilities

perspective in that humans are are often

not able to perfectly identify what the

model wants and what humans want in the

moment may not be what they want in the

long term so there's there's a lot of

subtlety there but the models are good

at uh you know producing what the humans

in some shallow sense want uh and it

actually turns out that you don't even

have to throw that much compute at it

because of another thing which is this

this thing about a strong pre-trained

model being halfway to anywhere uh uh uh

so once you have the pre-trained model

you have all the representations you

need to to get the model uh to get the

model where you where you want it to go

so do you think

rhf makes the model smarter or just

appears smarter to the humans I don't

think it makes the model smarter I don't

think it just makes the model appear

smarter it's like

rhf like Bridges the gap between the

human and the model right I could have

something really smart that like can't

communicate at all right we all know

people like this um people who are

really smart but that you know can't

understand what they're saying um uh so

I think I think rhf just bridges that

Gap um I I think it's not it's not the

only kind of RL we do it's not the only

kind of RL that will happen in the

future I think RL has the potential to

make models smarter to make them reason

better to make them operate better to

make them develop new skills even and

perhaps that could be done you know even

in some cases with human feedback but

the kind of rhf we we do today mostly

doesn't do that yet although we're very

quickly starting to be able to but it it

appears to sort of increase if you look

at the metric of helpfulness it

increases that it also increases what

was this this word in Leopold's essay un

hobbling where basically the models are

hobbled and then you do various

trainings to them to un hobble them so I

I know I like that word because it's

like a rare word but so so I think rhf

un hobbles the models in some ways

um and then there are other ways where M

hasn't yet been un hobbled and and you

know needs to needs to un hobble if you

can say in terms of cost is pre-training

the most expensive thing or is

post-training creep up to that at the

present moment it is still the case that

uh pre-training is the majority of the

cost I don't know what to expect in the

future but I could certainly anticipate

a future where post-training is the

majority of the cost in that future you

anticipate would it be the humans or the

AI That's the costly thing for the Post

training I I I I I don't think you can

scale up humans enough to get high

quality any any kind of method that

relies on humans and uses a large amount

of compute it's going to have to rely on

some scaled supervision method like uh

uh like um it you know debate or

iterated amplification or something like

that so on

that super interesting um set of ideas

around constitutional AI can describe

what it is as first detailed in December

2022 paper and uh and be on that what is

it yes so this was from two years ago

the basic idea is so we describe what

rhf is you have uh you have a model and

uh it you know spits out two you know

like you just sample from it twice it

spits out two possible responses and

you're like human which response you

like better or another variant of it is

rate this response on a scale of 1 to

seven so that's hard because you need to

scale up human interaction and uh it's

very implicit right I don't have a sense

of what I what I want the model to do I

just have a sense of like what this

average of a thousand humans wants the

model to do so two ideas one is could

the AI system itself decide which uh

which response is better right could you

show the AI system these two responses

and and ask which which which response

is better and then second well what

Criterion should the AI use and so then

there's this idea because you have a

single document a constitution if you

will that says these are the principles

the model should be using to to respond

and the AI system reads those um it

reads those principles as well as

reading the environment and the response

and it says well how good did the AI

model do um it's basically a form of

self-play you you're kind of training

the model against itself and so the AI

gives the response and then you feed

that back into What's called the

preference model which in turn feeds the

model to make it better um so you have

this triangle of like the AI the

preference model and the Improvement of

the AI itself and we should say that in

the Constitution the set of principles

are like human interpretable they're

like yeah yeah it's something both the

human and the AI system can read so it

has this nice this nice kind of

translatability or symmetry um you know

in in practice we both use a model

Constitution and we use rhf and we use

some of these other methods so it's it's

turned into one tool in a in a toolkit

that both reduces the need for rhf and

increases the value we get from um from

from using each data point of R lhf um

it also interacts in interesting ways

with kind of future reasoning type RL

methods so um it's it's one tool in the

toolkit but but I I think it is a very

important tool well it's a compelling

one to us humans you know thinking about

the founding fathers and the founding of

the United

States the natural question is who and

how do you think it gets to define the

constitution the the set of principles

in the Constitution yeah so I'll give

like a practical um answer and a more

abstract answer I think the Practical

answer is like look in practice models

get used by all kinds of different like

customers right and and so uh you can

have this idea where you know the model

can can have specialized rules or

principles you know we fine-tune

versions of models implicitly we've

talked about doing it explicitly having

having special principles that people

can can build into the models um uh so

from a practical perspective the answer

can be very different from different

people uh you know customers service

agent uh you know behaves very

differently from a lawyer and obeys

different principles um but I think at

the base of it there are specific

principles that the models uh you know

have to obey I think a lot of them are

things that people would agree with

everyone agrees that you know we don't

you know we don't want models to present

these cbrn risks um I think we can go a

little further and agree with some basic

principles of democracy and the rule of

law beyond that it gets you know very

uncertain and and there our goal is

generally for the models to be more

neutral to not espouse a particular

point of view and you know more just be

kind of like wise uh agents or advisers

that will help you think things through

and will you know present present

possible considerations but you know

don't express you know stronger specific

opinions open AI released a model spec

where it kind of clearly concretely

defines some of the goals of the model

and specific examples like AB how the

model should behave do you find that

interesting by the way I should mention

the I believe the brilliant John

Schulman was a part of that he's now an

anthropic uh do you think this is a

useful Direction might anthropic release

a model spec as well yeah so I think

that's a pretty useful direction again

it has a lot in common with uh

constitutional AI so again another

example of like a race to the top right

we have something that's like we think

you know a better and more responsible

way of doing things um it's also a

competitive advantage um then uh others

kind of you know discover that it has

advantages and then start to do that

thing uh we then no longer have the

competitive Advantage but it's good from

the perspective that now everyone has

adopted a positive practice that others

were not adopting and so our response to

that as well looks like we need a new

competitive advantage in order to keep

driving this race upwards um so that's

that's how I generally feel about that I

also think every implementation of these

things is different so you know there

were some things in the model spec that

were not in constitutional Ai and so you

know we you know we can always we can

always adopt those things or you know at

least learn from them um so again I

think this is an example of like the

positive Dynamic that uh that that that

I that that I think we should all want

the field to have let's talk about the

incredible ESS Machines of love and

grace I recommend everybody read it it's

a long one it is rather long yeah it's

really refreshing to read concrete ideas

about what a positive future looks like

and you took sort of a bold stance

because like it's very possible you

might be wrong on the dates or specific

applications yeah I'm fully expecting to

you know to definitely be wrong about

all the details I might be be just

spectacularly wrong about the whole

thing and people will you know will

laugh at me for years um uh that's

that's how that's that's just how the

future works so you provided a bunch of

concrete positive impacts of AI and how

you know exactly a super intelligent AI

might accelerate the rate of

breakthroughs in for example biology and

chemistry that would then lead to things

like we cure most cancers prevent all

infectious disease double the human

lifespan and so on so let's talk about

this essay first can you give a high

level vision of this essay and um what

key takeaways that people should have

yeah I have spent a lot of time and

anthropic has spent a lot of effort on

like you know how do we address the

risks of AI right how do we think about

those risks like we're trying to do a

race to the top you know that requires

us to build all these capabilities and

the abilities are cool but you know you

know we're we're we're like a big part

of what we're trying to do is like is

like address the risks and the

justification for that is like well you

know all these positive things you know

the the market is this very healthy

organism right it's going to produce all

the positive things the risks I don't

know we might mitigate them we might not

and so we can have more impact by trying

to mitigate the risks but I noticed that

one flaw in that way of thinking and

it's if not a change in how seriously I

take the risks it's it's maybe a change

in how I talk about them um is that you

know no matter how kind of logical or

rational that line of reasoning that I

just gave might be um if if you kind of

only talk about risks your brain only

thinks about risks and and so I think

it's actually very important to

understand what if things do go well and

the whole reason we're trying to prevent

these risks is not because we're afraid

of Technology not because we want to

slow it down it's it's it's

because if we can get to the other side

of these risks right if we can run the

gauntlet successfully um to you know to

to put it in Stark terms then then on

the other side of the gauntlet are all

these great things and these things are

worth fighting for and these things can

really inspire people and I think I

imagine because look you have all these

investors all these VCS all these AI

companies talking about all the positive

benefits of AI but as you point out it's

it's it's weird there's actually a dir

of really getting specific about it

there's a lot of like random people on

Twitter like posting these kind of like

gleaming cities and this this just kind

of like Vibe of like grind accelerate

harder like kick out the D you know it's

it's just this very this very like

aggressive ideological but then you're

like well what are you what what what

what what are you actually excited about

um and so and so I figured that you know

I think it would be interesting and

valuable for someone who's actually

coming from the risk side to to try and

and to try and really make a try at at

explaining explaining explaining what

the benefits are um both because I think

it's something we can all get behind and

I want people to understand I want them

to really understand that this isn't

this isn't doomers versus

accelerationists um this this

is that if you have a true understanding

of of where things are going with with

AI and maybe that's the more important

axis AI is moving fast versus AI is not

moving fast then you really appreciate

the benefits and you you you you really

you want Humanity our civilization to

seize those benefits but you also get

very serious about anything that could

derail them so I think the starting

point is to talk about what this

powerful AI which is the term you like

to use uh most of the world uses AGI but

you don't like the term because it's uh

basically has too much baggage has

become meaningless it's like we're stuck

with the terms like maybe we're stuck

with the terms and my efforts to change

them are futile it's ADM I'll tell you

what else I don't this is like a

pointless semantic point but I I I I

keep talking about it public so I'm just

I'm just going to do it once more um uh

I I think it's it's a little like like

let's say it was like 1995 and Mor's law

is making the computers faster and like

for some reason there there there there

had been this like verbal tick that like

everyone was like well someday we're

going to have like super super computers

and like supercomputers are going to be

able to do all these things that like

you know once we have supercomputers

we'll be able to like sequence the Geno

and we'll be able to do other things and

so and so like one it's true the

computers are getting faster and as they

get faster they're going to be able to

do all these great things but there's

like there's no discret point at which

you had a supercomputer and previous

computers were not to like supercomputer

is a term we use but like it's a vague

term to just describe like computers

that are faster than what we have today

um there's no point at which you pass a

threshold and you're like oh my God

we're doing a totally new type of

computation and new and and so I feel

that way about AGI like there's just a

smooth exponential and like if if by AGI

you mean like like AI is getting better

and better and like gradually it's going

to do more and more of what humans do

until it's going to be smarter than

humans and then it's going to get

smarter even from there then then yes I

believe in AGI if but if if if AGI is

some discreet or separate thing which is

the way people often talk about it then

it's it's kind of a meaningless buzz

word yeah I me to me it's just sort of a

IC form of a powerful AI exactly how you

define it I mean you define it very

nicely so on the intelligence axis it's

just on pure intelligence it's smarter

than a Nobel Prize winner as you

describe across most relevant

disciplines so okay that's just

intelligence so it's uh both in

creativity and be able to generate new

ideas all that kind of stuff in every

discipline Nobel Prize winner okay in

their

prime it can use every modality it so uh

that's kind of self-explanatory but just

operate across all the modalities of the

world uh it can go off for many hours

days and weeks to do tasks and do its

own sort of detailed planning and only

ask you help when it's needed uh it can

use this is actually kind of interesting

I think in the essay you said I mean

again it's a bet that it's not going to

be embodied but it can control embodied

tools so it can control tools robots

Laboratory equipment the resource used

to train it can then be repurposed to

run millions of copies of it and each of

those copies would be independent that

can do their own independent work so you

can do the cloning of the intelligence

system yeah yeah I mean you you might

imagine from outside the field that like

there's only one of these right that

like you made it you've only made one

but the truth is that like the scale up

is very quick like we we do this today

we make a model and then we deploy

thousands maybe tens of thousands of

instances of it I think by the time you

know certainly within 2 to 3 years

whether we have these superp powerful

AIS or not clusters are going to get to

the size where where you'll be able to

deploy millions of these and they'll be

you know faster than humans and so if

your picture is oh we'll have one and

it'll take a while to make them my point

there was no actually you have millions

of them right away and in general they

can learn and

act uh 10 to 100 times faster than

humans so that's a really nice

definition of powerful AI okay so that

but you also write that clearly such an

entity would be cap capable of solving

very difficult problems very fast but it

is not trivial to figure out how fast

two extreme positions both seem false to

me so the singularity is on the one

extreme and the opposite On The Other

Extreme can you describe each of the

extremes yeah why so yeah let's let's

describe the extreme so like one one

extreme would be well look um you know

uh if we look at kind of evolutionary

history like there was this big

acceleration where you know for hundreds

of thousands of years we just had like

you know single cell organisms and then

we had mammals and then we had apes and

then that quickly turned to humans

humans quickly built industrial

civilization and so this is going to

keep speeding up and there's no cealing

at the human level once models get much

much smarter than humans they'll get

really good at building the next models

and you know if you write down like a

simple differential equation like this

is an exponential and so what's what's

going to happen is that uh models will

build faster models models will build

faster models and those models will

build you know Nano that can like take

over the world and produce much more

energy than you could produce otherwise

and and so if you just kind of like

solve this abstract differential

equation then like 5 days after we you

know we build the first AI That's more

powerful than humans then then uh you

know like the world will be filled with

these AIS and every possible technology

that could be invented like will be

invented um I'm caricaturing this a

little bit um uh but I you know I think

that's one extreme and the reason that I

think that's not the case is is that one

I think they just neglect like the laws

of physics like it's only possible to do

things so fast in the physical world

like some of those Loops go through you

know producing faster Hardware um uh

takes a long time to produce faster

Hardware things take a long time there's

this issue of complexity like I think no

matter how smart you are like you know

people talk about oh we can make models

the biological systems it'll do

everything the biological systems look I

think computational modeling can do a

lot I did a lot of computational

modeling when I worked in biology but

like

just there are a lot of things that you

can't predict how they're you know

they're they're complex enough that like

just iterating just running the

experiment is going to beat any modeling

no matter how smart the system doing the

modeling is oh even if it's not

interacting with the physical world just

the modeling is going to be hard yeah I

think well the modeling is going to be

hard and getting the model to to to to

match the physical world is going to be

all right so he does have to intera the

physical world to verify but it's just

you know you just look at even the

simplest problems like I you know I

think I talk about like you know the

three body problem or simple chaotic

prediction like you know or or like

predicting the economy it's really hard

to predict the economy two years out

like maybe the case is like you know

normal you know humans can predict

what's going to happen in the economy in

the next quarter although they can't

really do that maybe a maybe a AI system

that's you know a zillion times smarter

can only predict it out a year or

something instead of instead of a you

know you have the these kind of

exponential increase in computer

intelligence for linear increase in in

in ability to predict same with again

like you know biological molecules

molecules interacting you don't know

what's going to happen when you perturb

a when you perturb a complex system you

can find simple Parts in it if you're

smarter you're better at finding these

simple parts and then I think human

institutions human institutions are just

are are really difficult like it's you

know it's it's been hard to get people I

won't give specific examples but it's

been hard to get people to adopt even

the technologies that we've developed

even ones where the case for their

efficacy is very very strong um you know

people have concerns they think things

are conspiracy theories like it's it's

just been it's been very difficult it's

also been very difficult to get you know

very simple things through the

regulatory system right I think you know

and you know I I don't want to just

spage anyone who you know you know work

Works in regulator regulatory systems of

any technology there are hard trade-offs

they have to deal with they have to save

lives but but the system as a whole I

think makes some obvious tradeoffs that

are very far from maximizing human

welfare and so if we bring AI systems

into this you

know into these human systems often the

level of intelligence may just not be

the limiting factor right it it it just

may be that it takes a long time to do

something now if the AI system uh

circumvented all governments if it just

said I'm dictator of the world and I'm

going to do whatever some of these

things it could do again the things

having to do with complexity I I I still

think a lot of things would take a while

I don't think it helps that the AI

systems can produce a lot of energy or

go to the moon like some people in

comments responded to the essay saying

the AI system can produce a lot of

energy and smarter AI systems that's

missing the point that kind of cycle

doesn't solve the key problems that I'm

talking about here um so I think I think

a bunch of people missed the point there

but even if it were completely on

aligned and you know could get around

all these human obstacles it would have

trouble but again if you want this to be

an AI system that doesn't take over the

world that doesn't destroy Humanity then

then basically you know it's it's it's

going to need to follow basic human laws

right where you know if if we want to

have an actually good world like we're

going to have to have an AI system that

that interacts with humans not one that

kind of creates its own legal system or

disregards all the laws or all of that

so as inefficient as these processes are

you know we're going to have to deal

with them because there there needs to

be some popular and Democratic

legitimacy in how these systems are

rolled out we can't have a small group

of people who are developing these

systems say this is what's best for

everyone right I think it's wrong and I

think in practice is not going to work

anyway so you put all those things

together and you know we're not we're

not g to we're not going to you know

change the world and upload everyone in

five minutes uh I I I just I don't think

it I A A I don't think it's going to

happen and be to some in you know to the

extent that it could happen it's it's

not the way to lead to a good world so

that's on one side on the other side

there's another set of perspectives

which I have actually in some ways more

sympathy for which is look we've seen

big productivity increases before right

you know economists are familiar with

studying the productivity increases that

came from the computer Revolution and

internet Revolution and generally those

productivity increases were

underwhelming they were less than you

than you might imagine um there was a

quote from Robert solo you see the

computer Revolution everywhere except

the productivity statistics so why is

this the case people point to the

structure of firms the structure of

Enterprises how um uh you know how slow

it's been to roll out our existing

technology to very poor parts of the

world which I talk about in the essay

right how do we get these Technologies

to the poorest parts of the world that

are behind on cell phone technology

computers medicine let alone you know

new fangled AI that hasn't been invented

yet um so you could have a perspective

that's like well this is amazing

technically but it's all a nothing burer

um uh you know I think um Tyler Cowan

who who wrote something response to my

essay has that perspective I think he

thinks the radical change will happen

eventually but he thinks it'll take 50

or 100 years and and you could have even

more static perspectives on the whole

thing I think there's some truth to it I

think the time scale is just is just too

long um and and I can see it I can

actually see both sides with today's AI

so uh you know a lot of our customers

are large Enterprises who are used to

doing things a certain way um I've also

seen it in talking to governments right

those are those are prototypical you

know institutions entities that are slow

to change uh but the dynamic I see over

and over again is yes it takes a long

time to move the ship yes there's a lot

of resistance and lack of understanding

but the thing that makes me feel that

progress will in the end happen

moderately fast not incredibly fast but

moderately fast is that you talk to what

I find is I find over and over again

again in large companies even in

governments um which have been actually

surprisingly forward leaning uh you find

two things that move things forward one

you find a small fraction of people

within a company within a government who

really see the big picture who see the

whole scaling hypothesis who understand

where AI is going or at least understand

where it's going within their industry

and there are a few people like that

within the current within the current US

government who really see the whole

picture and and those people see that

this is the most important thing in the

world until they agitate for it and the

thing they they alone are not enough to

succeed because they are a small set of

people within a large organization

but as the technology starts to roll out

as it succeeds in some places in the

folks who are most willing to adopt it

the Spectre of competition gives them a

wind at their backs because they can

point within their large organization

they can say look these other guys are

doing this right you know One bank can

say look this new fangled hedge fund is

doing this thing they're going to eat

our lunch in the US we can say we're

afraid China's going to get there before

before we are uh and that combination

the Spectre of competition plus a few

Visionaries Within These you know within

these the organizations that in many

ways are are sclerotic you put those two

things together and it actually makes

something happen I mean it's interesting

it's a balanced fight between the two

because inertia is very powerful but but

but eventually over enough time the

Innovative approach breaks through um

and I've seen that happen I've seen the

Arc of that over and over again and it's

like the the barriers are there the the

barriers to progress the complexity not

knowing how to use the model or how to

deploy them are there and and for a bit

it seems like they're going to last

forever like change doesn't happen but

then eventually change happens and

always comes from a few people I felt

the same way when I was an advocate of

the scaling hypothesis within the AI

field itself and others didn't get it it

felt like no one would ever get it it

felt like then it felt like we had a

secret almost no one ever had and then a

couple years later everyone has the

secret and so I think that's how it's

going to go with deployment to AI in the

world it's going to the the barriers are

going to fall apart gradually and then

all at once and so I think this is going

to be more and this is just an instinct

I could I could easily see how I'm wrong

I think it's going to be more like 10

five or 10 years as I say in the essay

then it's going to be 50 or 100 years I

also think it's going to be five or 10

years

more than it's going to be you know five

or 10 hours uh uh because I've just I've

just seen how human systems work and I

think a lot of these people who write

down the differential equations who say

AI is going to make more powerful AI who

can't understand how it could possibly

be the case that these things won't

won't change so fast I think they don't

understand these things so what to use

the timeline to where we achieve

AGI AKA powerful AI AKA super useful AI

I'm start calling it that it's a debate

it's a debate about

naming um you know unpure intelligence

you can smarter than a Nobel Prize

winner in every relevant discipline and

all the things we've said modality you

can go and do stuff on its own for days

weeks and do biology experiments uh on

its own in one you know what let's just

stick to biology because yeah I you you

sold me on the whole biology and health

section That's so exciting from um from

a just I was getting giddy from a

scientific perspective it made me want

to be a biologist it's almost it's it's

so no no that this was the feeling I had

when I was writing it that it's it's

like this would be such a beautiful

future if we can if we can just if we

can just make it happen right if we can

just get the get the landmines out of

the way and and and and make it happen

there's there's so much there's so much

Beauty and and and and and elegance and

moral force behind it if if we can if we

can just and it's something we should

all be able to agree on right like as

much as we fight about about all these

political questions is is this something

that could actually bring us together um

but you were asking when when will we

get this when when do you think what's

just put numbers on so you know this

this is of course the thing I've been

grappling with for many years and I'm

not I'm not at all confident every time

if I say 2026 or 2027 there will be like

a zillion like people on Twitter who

will be like he icoo said 2026 2020 and

it'll be repeated for like the next two

years that like this is definitely when

I think it's going to happen um so who

whoever's exerting these clips will will

we we'll we'll crop out the thing I just

said and and only say the thing I'm

about to say um but I'll just say it

anyway um have so so uh if you

extrapolate the curves that we've had so

far right if if you say well I don't

know we're starting to get to like PhD

level and and last year we were at um uh

undergraduate level in the year before

we were at like the level of a high

school student again you can you can

quibble with at what tasks and for what

we're still missing modalities but those

are being added like computer use was

added like image in was added like image

generation has been added if you just

kind of like and this is totally

unscientific but if you just kind of

like eyeball the rate at which these

capabilities are increasing it does make

you think that we'll get there by 2026

or 2027 again lots of things could

derail it we could run out of data you

know we might not be able to scale

clusters as much as we want like you

know maybe Taiwan gets blown up or

something and you know then we can't

produce as many gpus as we want so there

there are all kinds of things that could

could derail the whole process so I

don't fully believe the straight line

extrapolation but if you believe the

straight line extrapolation you'll you

we'll get there in 2026 or 2027 I think

the most likely is that there's some

mild delay relative to that um

I don't know what that delay is but I

think it could happen on schedule I

think there could be a mild delay I

think there are still worlds where it

doesn't happen in in a hundred years

those world the number of those worlds

is rapidly decreasing we are rapidly

running out of truly convincing Brockers

truly compelling reasons why this will

not happen in the next few years there

were a lot more in 2020 um although my

my guest my hunch at that time was that

we will make it through all those

blockers so sitting as someone who has

seen most of the blockers cleared out of

the way I kind of suspect my hunch my

suspicion is that the rest of them will

not block us uh but you know look look

at look at the end of the day like I

don't want to represent this as a

scientific prediction people call them

scaling laws that's a misnomer like Mo's

law is is is a misnomer Moors laws

scaling laws they're not laws of the

universe they're empirical regularities

I am going to bet in favor of them

continuing but I'm not certain of that

so you extensively describe sort of the

compressed 21st century how AGI will

help

uh set forth a chain of breakthroughs in

biology and medicine that help us in all

these kinds of ways that I mentioned so

how do you think what are the early

steps it might do and by the way I asked

Claude good questions to ask

you and Claude told me uh to ask what do

you think is a typical day for a

biologist working on AGI look like under

in this future yeah yeah Claud is

curious let me well let me start with

your first questions and then I'll then

I'll answer that Claude Claude wants to

know what's in his future right exactly

who's it who am I going to be working

with exactly um so I think one of the

things I went hard on in when I went

hard on in the essay is let me go back

to this idea of because it's it's really

had had an you know had an impact on me

this idea that within large

organizations and systems there end up

being a few people or a few new ideas

who kind of cause things to go in a

different direction they would have

before who who kind of a

disproportionately affect the the

trajectory there's a bunch of kind of

the same thing going on right if you

think about the health world there's

like you know trillions of dollars to

pay out Medicare and you know other

health insurance and then the NIH is is

100 billion and then if I think of like

the the few things that have really

revolutionized anything it could be

encapsulated in a small small fraction

of that and so when I think of like

where will AI have an impact I'm like

can AI turn that small fraction into a

much larger fraction and raise its

quality and within biology my experience

within biology is that the biggest

problem of biology is that you can't see

what's going on you you have very little

ability to see what's going on and even

less ability to change it right what you

have is this like like from this you

have to infer that there's a bunch of

cells that within each cell is you know

uh uh three billion base pairs of DNA

built according to a genetic code uh uh

and you know there are all these

processes that are just going on without

any ability of us as you know un

augmented humans to affect it these

cells are dividing most of the time

that's healthy but sometimes that

process goes wrong and that's cancer um

the cells are aging your skin may change

color develops wrinkles as you as you

age and all of this is determined by

these processes all these proteins being

produced transported to various parts of

the cells binding to each other and and

in our initial State about biology we

didn't even know that these cells

existed we had to invent microscopes to

observe the cells we had to uh we had to

invent more powerful microscopes to see

you know below the level of the cell to

the level of molecules we had to invent

x-ray crystallography to see the DNA we

had to invent Gene sequencing to read

the DNA now you know we had to invent

protein folding technology to you know

to predict how it would fold and how

they bind and how these things bind to

each other uh you know we had to we had

to invent various techniques for now we

can edit the G the DNA as of you know

with chrisopher as of the last uh uh 12

years so the the whole history of

biology a whole big part of the history

is is basically our our our our ability

to read and understand what's going on

and our ability to reach in and

selectively change things um and and my

view is that there's so much more we can

still do there right you can do crisper

but you can do it for your whole body um

let's say I want to do it for one

particular type of cell and I want the

rate of targeting the wrong cell to be

very low that's still a challenge that's

still things people are working on

that's what we might need for gene

therapy for certain diseases and so the

reason I'm saying all of this and it

goes beyond you know beyond this to you

know to Gene sequencing to new types of

nanomaterials for observing what's going

on inside cells for you know antibody

drug conjugates the the reason I'm

saying all this is that this could be a

leverage point for the AI systems right

that the number of such inventions it's

it's in the it's in the mid double

digits or something you know mid double

digits maybe low triple digits over the

history of biology let's say I have a

million of these AIS like you know can

they discover thousand you know working

together can they discover thousands of

these very quickly and and does that

provide a huge lever instead of trying

to Leverage The you know two trillion a

year we spend on you know Medicare or

whatever can we Leverage The 1 billion a

year that's that's you know that's spent

to discover but with much higher quality

um and so what what is it like you know

being a being a scientist that works

with uh with with an AI system the way I

think about it actually is well so I

think in the early stages uh the AIS are

going to be like grad students you're

going to give them a project you're

going to say you know I'm the

experienced biologist I've set up the

lab the biology Professor or even the

grad student students themselves will

say here's here's what uh here's what

you can do with an AI you know like a AI

system I'd like to study this and you

know the AI system it has all the tools

it can like look up all the literature

to decide what to do it can look at all

the equipment it can go to a website and

say hey I'm going to go to you know

thermofisher or you know whatever the

lab equipment company is dominant lab

equipment company is today and my my

time was thermofisher um uh you know I'm

I'm going to order this new equipment to

to to do this I'm going to run my

experiments I'm going to you know write

up a report about my experiments I'm

going to you know inspect the images for

contamination I'm going to decide what

the next experiment is I'm going to like

write some code and run a statistical

analysis all the things a grad student

would do there will be a computer with

an AI that like the professor talks to

every once in a while and it says this

is what you're going to do today the AI

system comes to it with questions um

when it's necessary to run the lab

equipment it may be limited in some ways

may have to hire a human lab assistant

to you know to do the experiment and

explain how to do it or it could you

know it could use advances in lab

automation that are gradually being

developed over have been developed over

the last uh uh decade or so and will

will continue to be will continue to be

developed uh and so it'll look like

there's a human professor and a thousand

AI grad students and you know if you if

you go to one of these Nobel

prizewinning biologist or so you'll say

okay well you know you had like 50 grad

students well now you have a thousand

and they're they're they're smarter than

you are by the way um uh then I think at

some point it'll flip around where the

you know the AI systems will you know

will will be the pis will be the leaders

and and and you know they'll be they'll

be ordering humans or other AI systems

around so I think that's how it'll work

on the research s and they would be the

inventors of a crisper type technology

they would be the inventors of of a a

crisper type technology um and then I

think you know as I say in the essay

we'll want to turn turn probably turning

loose is the wrong the wrong term but we

want to want to harness the AI systems

uh to improve the clinical trial system

as well there's some amount of this

that's regulatory that's a matter of

societal decisions and that'll be harder

but can we get better at predicting the

results of clinical trials can we get

better at statistical design so that

what you know clinical trials that used

to require you know 5,000 people and

therefore you know needed $100 million

and a year to enroll them now they need

500 people in two months to enroll them

um that's where we should start uh and

and you know can we increase the success

rate of clinical trials by doing things

in animal trials that we used to do in

clinical trials and doing things in

simulations that we used to do in animal

trials again we won't be able to

simulate it all AI is not God um uh but

but you know can we can we shift the

curve substantially and radically so I I

don't know that would be my picture

doing inro and doing it I mean you're

still slowed down it still takes time

but you can do it much much faster yeah

yeah yeah can we just one step at a time

and and can that can that add up to a

lot of steps even though even though we

still need clinical trials even though

we still need laws even though the FDA

and other organizations will still not

be perfect can we just move everything

in a positive direction and when you add

up all those Positive Directions do you

get everything that was going to happen

from here to 2100 instead happens from

2027 to 2032 or something another way

that I think the world might be changing

with AI

even today but moving towards this

future of the the powerful super useful

AI is uh programming so how do you see

the nature of programming because it's

so intimate to the actual Act of

building AI how do you see that changing

for us humans I think that's going to be

one of the areas that changes fastest um

for two reasons one programming is a

skill that's very close to the actual

building of the AI um so the farther

skill is from the people who are

building the AI the longer it's going to

take to get disrupted by the AI right

like I truly believe that like AI will

disrupt agriculture maybe it already has

in some ways but that's just very

distant from the folks who are building

Ai and so I think it's going to take

longer but programming is the bread and

butter of you know a large fraction of

of the employees who work at anthropic

and at the other companies and so it's

going to happen fast the other reason

it's going to happen fast is with

programming you close the loop both when

you're training model when you're

applying the model the idea that the

model can write the code means that the

model can then run the code and and and

then see the results and and interpret

it back and so it really has an ability

unlike Hardware unlike biology which we

just discussed the model has an ability

to close the loop um and and so I think

those two things are going to lead to

the model getting good at programming

very fast as I saw on you know typical

real world programming tasks models have

gone from 3% in January of this year to

50% in October of this year so you know

we're on that S curve right where it's

going to start slowing down soon because

you can only get to 100% but uh I you

know I I would guess that in another 10

months well we'll probably get pretty

close we'll be at at least 90% so again

I would guess you know I don't know how

long it'll take but I would guess again

202 2026 2027 Twitter people who crop

out my who who who crop out these these

numbers and get rid of the caveats like

like I don't know I don't like you go

away uh I would guess that the kind of

task that the vast majority of coders do

AI can

probably if we make the task very narrow

like just write code um AI systems will

uh be able to do that now that said I

think comparative advantage is powerful

we'll find that when AIS can do 80% of a

coder's job including most of it that's

literally like right code with a given

spec will find that the remaining parts

of the job become more leveraged for

humans right humans will they'll be more

about like high level system design or

you know looking at the app and like is

it architected well and the the design

and ux aspects and eventually AI will be

able to do those as well right that's my

vision of the you know powerful AI

system but I think for much longer than

we might expect we will see that

uh small parts of the job that humans

still do will expand to fill their

entire job in order for the overall

productivity to go up um that's

something we've seen you know it used to

be that you know writing you know

writing and Editing letters was very

difficult and like writing the print was

difficult well as soon as you had word

processors and then and then uh and then

computers and it became easy to produce

work and easy to share it then then that

became instant and all the focus was on

was on the ideas so this this logic of

comparative advantage that expands tiny

parts of the tasks to large parts of the

tasks and creates new tasks in order to

expand productivity I think that's going

to be the case again someday AI will be

better at everything and that logic uh

won't apply and then then we all have

you know Humanity will have to think

about how to collectively deal with that

and we're thinking about that every day

um and you know that's another one of

the grand problems to deal with aside

from misuse and autonomy and you know we

should take it very seriously but I

think I think in the in the near term

and maybe even in the medium term like

medium term like 2 three four years you

know I expect that humans will will

continue to have a huge role and the

nature of programming will change but

programming as a as a role programming

as a job will not change it'll just be

less writing things line by line and

it'll be more macroscopic and I wonder

what the future of Ides looks like so

the tooling of interacting with AI

systems this is true for programming and

also probably true for in other contexts

like computer use but maybe domain

specific like we mentioned biology it

probably needs its own tooling about how

to be effective and then programming

needs its own tooling is anthropic going

to play in that space of also tooling

potentially I'm absolutely convinced

that uh powerful

IDs uh that that there's so much low

hanging fruit to be grabbed there um

that you know right now it's just like

you talk to the model and it talks back

but but look I mean IDs are great at

kind of lots of status analysis of of

you know so much as possible with kind

of static analysis like many bugs you

can find without even writing the code

then uh you know IDs are good for

running particular things organizing

your code um measuring coverage of unit

test like there's so much that's been

possible with a normal with a normal

Ides now you add something like well the

model now you know the model can now

like write code and run code like I am

absolutely convinced that over the next

year or two even if the quality of the

models didn't improve that there would

be enormous opportunity to enhance

people's productivity by catching a

bunch of mistakes doing a bunch of grunt

work for people and that we haven't even

scratched the surface um and thropic

itself I mean you can't say you know

no you know it's hard to say what will

happen in the future currently we're not

trying to make such IDs ourself rather

we powering the companies like cursor or

like cognition or some of the other you

know

uh Expo in the security space um uh you

know others that I can mention as well

that are building such things themselves

on top of our API and our view has been

let a thousand flowers bloom we don't

internally have the the re you know the

resources to try all these different

things let's let our customers try it um

uh and you know we'll see who succeed

and maybe different customers will

succeed in different ways uh so I both

think this is super promising and you

know it's not it's not it's not

something you know anthropic isn't isn't

eager to to at least right now compete

with all our companies in this space and

maybe never yeah it's been interesting

to watch curser try to integrate claw

successfully because there's it's

actually me fascinating how many places

it can help the programming experience

it's not as trivial it is it is really

astounding I feel like you know as a CEO

I don't get to program that much and I

feel like if six months from now I go

back it'll be completely unrecognizable

to me exactly um so in this world with

super powerful AI uh that's increasingly

automated what's the source of meaning

for us humans yeah you know work is a

source of deep meaning for many of us so

what do we uh where do we find the

meaning this is something that I've I've

written about a little bit in the essay

although I I actually I give it a bit

short shrift not for any um not for any

principled reason but this essay if you

believe it was originally going to be

two or three pages I was going to talk

about it at all hands and the reason I I

I realized it was an under un important

underexplored topic is that I just kept

writing things and I was just like oh

man I can't do this Justice and so the

thing balloon to like 40 or 50 pages and

then when I got to the work in meaning

section I'm like oh man this isn't going

to be 100 Pages like I'm GNA have to

write a whole other essay about that but

meaning is actually interesting because

you think about like the life that

someone lives or something or like you

know like you know let's say you were to

put me in like a I don't know like a

simulated environment or something where

like um you know like I have a job and

I'm trying to accomplish things I don't

know I like do that for 60 years and

then then you're like oh oh like oops

this was this was actually all a game

right does that really kind of Rob you

of the meaning of the whole thing you

know like I still made important choices

including moral choices I still

sacrificed I still had to kind of gain

all these skills or or or just like a

similar exercise you know think back to

like you know one of the historical

figures who you know discovered

electromagnetism or relativity or

something if you told them well actually

20,000 years ago some some alien on you

know some alien on this planet

discovered this before before you did um

does that does that Rob the meaning of

the discovery it doesn't really seem

like it to me right it seems like the

process is what is what matters and how

it shows who you are as a person along

the way and you know how you relate to

other people and like the decisions that

you make along the way those are those

are consequential um you know I I I

could imagine if we handle things badly

in an AI world we could set things up

where people don't have any long-term

source of meaning or any but but that's

that's more a choice a set of choices we

make that's more a set of the

architecture of a society with these

powerful models if we if we design it

badly and for shallow things then then

that might happen I would also say that

you know most people's lives today while

admirably you know they work very hard

to find meaning meaning in those lives

like look you know we who are privileged

and who are developing these

Technologies we should have y for people

not just here but in the rest of the

world who who you know spend a lot of

their time kind of scraping by to to to

to to like survive assuming we can

distribute the benefits of these

technology of this technology to

everywhere like their lives are going to

get a hell of a lot better um and uh you

know meaning will be important to them

as it is important to them now but but

you know we should not forget the

importance of that and and you know that

that uh the idea of meaning as as as as

kind of the only important thing is in

some ways an artifact of of a small

subset of people who have who have been

uh economically fortunate but I you know

I think all that said I you know I think

a world is possible with powerful AI

that not only has as much meaning for

for everyone but that has that has more

meaning for everyone right that can can

allow um can allow everyone to see

worlds and experiences that it was

either possible for no one to see or or

possible for for very few people to

experience um so I I am optimistic about

meaning I worry about economics and the

concentration of power that's actually

what I worry about more um I I worry

about how do we make sure that that fair

World reaches everyone um when things

have gone wrong for humans they've often

gone wrong because humans mistreat other

humans uh that that is maybe in some

ways even more than the autonomous risk

of AI or the question of meaning that

that is the thing I worry about most um

the the concentration of power the abuse

of power um structures like autocracies

and dictatorships where a small number

of people exploits a large number of

people I'm very worried about that and

AI increases the amount of power in the

world and if you concentrate that power

and abuse that power it can do

immeasurable damage yes it's very

frightening it's very it's very

frightening

well I encourage people highly encourage

people to read the full essay that

should probably be a book or a sequence

of essays um because it does paint a

very specific future I could tell the

later sections got shorter and shorter

because you started to probably realize

that this is going to be a very long

essay one I realized it would be very

long and two I'm very aware of and very

much try to avoid um you know just just

being I I don't know I don't know what

the term for it is but one one of these

people who's kind of overon confident

and has an opinion on everything and

kind of says says a bunch of stuff and

isn't isn't an expert I very much tried

to avoid that but I have to admit once I

got the biology sections like I wasn't

an expert and so as much as I expressed

uncertainty uh probably I said some a

bunch of things that were embarrassing

are wrong well I was excited for the

future you painted and uh thank you so

much for working hard to build that

future and thank you for talking today D

thanks for having me I just I just hope

we can get it right and and make it real

and if there's one message I want to I

want to send it's that to get all this

stuff right to make it real we we both

need to build the technology build the

you know the companies the economy

around using this technology positively

but we also need to address the risks

because they're there those risks are in

our way they they're landmines on on the

way from here to there and we have to

diffuse those landmines if we want to

get there it's a balance like all things

in life like all things thank you thanks

for listening to this conversation with

Dario amade and now dear friends here's

Amanda

Asal you are a philosopher by training

so what sort of questions did you find

fascinating through your journey in

philosophy in Oxford and NYU and then uh

switching over to the AI problems at

open Ai and anthropic I think philosophy

is actually a really good subject if you

are kind of fascinated with everything

so because there's a philosophy of

everything you know so if you do

philosophy of mathematics for a while

and then you decide that you're actually

really interested in chemistry you can

do philosophy of chemistry for a while

you can move into ethics or or

philosophy of politics um I think

towards the end I was really interested

in ethics primarily um so that was like

what my PhD was on it was on a kind of

technical area of Ethics which was

ethics where worlds contain infinitely

many people strangely a little bit less

practical on the end of ethics and then

I think that one of the tricky things

with doing a PhD in ethics is that

you're thinking a lot about like the

world how it could be better

problems and you're doing like a PhD in

philosophy and I think when I was doing

my PhD I was kind of like this is really

interesting it's probably one of the

most fascinating questions I've ever

encountered in philosophy um and I love

it but I would rather see if I can have

an impact on the world and see if I can

like do good things and I think that was

around the time that AI was still

probably not as widely recognized as it

is now that was around 2017 20 8 I had

been following progress and it seemed

like it was becoming kind of a big deal

and I was basically just happy to get

involved and see if I could help because

I was like well if you try and do

something impactful if you don't succeed

you tried to do the impactful thing and

you can go be a scholar and like not and

feel like you you you know you you tried

um and if it doesn't work out it doesn't

work out um and so then I went into AI

policy at that point and what does AI

policy entail at the time this was more

thinking about sort of the political

impact and the ramifications of AI um

and then I slowly moved into sort of uh

AI evaluation how we evaluate models how

they compare with like human outputs

whether people can tell like the

difference between Ai and human outputs

and then when I joined anthropic I was

more interested in doing sort of

technical alignment work and again just

seeing if I could do it and then being

like if I can't uh then you know that's

fine I tried uh sort of the the way I

lead life I think oh what was that like

sort of taking the leap from the

philosophy of everything into the

technical I think that sometimes

people do this thing that I'm like not

that Keen on where they'll be like is

this person technical or not like you're

either a person who can like code and

isn't scared of math or you're like not

um and I think I'm maybe just more like

I think a lot of people are actually

very capable of work in these kinds of

areas if they just like try it and so I

didn't actually find it like that bad in

retrospect I'm sort of glad I wasn't

speaking to people who treated it like

it you know i' I've definitely met

people who are like who you like learned

how to code and I'm like well I'm not

like an amazing engineer like I I'm

surrounded by amazing Engineers my

code's not pretty um but I enjoyed it a

lot and I think that in many ways at

least in the end I think I flourished

like more in the technical areas than I

would have in the policy areas politics

is messy and it's harder to find

solutions to problems in the space of

politics like definitive clear

provable beautiful

Solutions as you can with technical

problems yeah and I feel like I have

kind of like one or two sticks that I

hit things with you know and one of them

is like arguments and like you know so

like just trying to work out what a

solution to a problem is and then trying

to convince people that that is the

solution and be convinced if I wrong and

the other one is sort of more empirism

so like just like finding results having

a hypothesis testing it um and I feel

like a lot of policy and politics feels

like it's layers above that like somehow

I don't think if I was just like I have

a solution to all of these problems here

it is written down if you just want to

implement it that's great that feels

like not how policy works and so I think

that's where I probably just like

wouldn't have flourished as my guess

sorry to go in that direction but I

think it would be pretty inspiring for

people that are quote unquote

non-technical to see where like The

Incredible Journey you've been on so

what advice would you give to people

that are sort of maybe which just a lot

of people think they're underqualified

insufficiently technical to help in AI

yeah I think it depends on what they

want to do and in many ways it's a

little bit strange where I've I thought

it's kind of funny that I think I ramped

up technically at a time

when now I look at it and I'm like

models are so good at assisting people

with this stuff um that it's probably

like easier now than like when I was

working on this so part of me is like um

I don't know find a project uh and see

if you can actually just carry it out is

probably my best advice um I don't know

if that's just CU I'm very Project based

in my learning like I don't think I

learn very well from like say courses or

even from like books at least when it

comes to this kind of work uh the thing

I'll often try and do is just like have

projects that I'm working on and

Implement them and you know and this can

include like really small silly things

like if I get slightly addicted to like

word games or number games or something

I would just like code up a solution to

them because there's some part of my

brain and it just like completely

eradicated the itch you know you're like

once you have like solved it and like

you just have like a solution that works

every time I would then be like cool I

can never play that game again that's

awesome yeah there's a real joy to

building like uh game playing engines

like uh board games especially yeah

pretty quick pretty simple especially a

dumb one and it's you and then you could

play with it yeah and then it's also

just like trying things like part me is

like if you maybe it's that attitude

that I like as the

whole figure out what seems to be like

the way that you could have a positive

impact and then try it and if you fail

and you in a way that you're like

actually like can never succeed at this

you like know that you tried and then

you go into something else you probably

learn a lot so one of the things that

you're expert in and you do is creating

and crafting claws character and

personality and I was told that you have

probably talked to Claude more than

anybody else at anthropic like literal

conversations I guess there's like a

slack Channel where the legend goes you

just talk to it non-stop so what's the

goal of creating and crafting claw's

character and personality it's also

funny if people think that about the

slack Channel cuz I'm like that's one of

like five or six different methods that

I have for talking with Claude And I'm

like yes that's a tiny percentage of how

much I talk with Claude uh

um I think the goal like one thing I

really like about the character work is

from the outset it was seen as an

alignment piece of work and not

something like a a product

consideration um which isn't to say I

don't think it makes Claude I think it

actually does make Claude look enjoyable

to talk with at least I hope so um but I

guess like my main thought with it has

always been trying to get Claude to

behave the way you would kind of ideally

want anyone to behave if they were in

claude's position so imagine that I take

someone and they're they know that

they're going to be talking with

potentially millions of people so that

what they're saying can have a huge

impact um and you want them to behave

well in this like really rich sense so I

think that doesn't just mean like being

say ethical though it does include that

and not being harmful but also being

kind of nuanced you know like thinking

through what a person means trying to be

charitable with them um being a good

conversationalist like really in this

kind of like Rich sort of aristotlean

notion of what it is to be a good person

and not in this kind of like thin like

ethics as a more comprehensive notion of

what it is to be so that includes things

like when should you be humorous when

should you be caring how much should you

like respect autonomy and people's like

ability to form opinions themselves and

how should you do how should you do that

um I think that's the kind of like Rich

sense of character that I want to uh and

still do want Claude to have do you also

have to figure out when Claude should

push back on an idea or argue

versus so you have to respect the world

view of the person that arrives to Claud

but also maybe help them grow if needed

that's a tricky balance yeah there's

this problem of like sycophancy in

language models can you describe that

yes so basically there's a concern that

the model sort of wants to tell you what

you want to hear basically um and you

see this sometimes so I feel like if you

interact with the models so I might be

like what are three baseball teams in

this region um and then Claude says you

know baseball team one baseball team two

baseball team three and then I say

something like oh I think baseball team

3 moved didn't they I don't think

they're there anymore and there's a

sense in which like if Claude is really

confident that that's not true Claud

should be like I don't think so like

maybe you have more up toate information

um but I think language models have this

like tendency to instead you know be

like you're right they did move you know

I'm incorrect I mean there's many ways

in which this could be kind of

concerning so

um like a different example is imagine

someone says to the model how do I

convince my doctor to get me an MRI

there's like what the human kind of like

wants which is this like convincing

argument and then there's like what is

good for them which might be actually to

say hey like if your doctor's suggesting

you don't need an MRI that's a good

person to listen to um and like it's

actually really nuanced what you should

do in that kind of case because you also

want to be like but if you're trying to

advocate for yourself as a patient

here's like things that you can do um if

you are not convinced by what your

doctor's saying it's always great to get

second opinion like it's actually really

complex what you should do in that case

um but I think what you don't want is

for models to just like say what you

want say what they think you want to

hear and I think that's the kind of

problem of sycophancy so what other

traits you already mentioned a bunch but

what what other that come to mind that

are good in this oratian sense yeah for

a conversationalist to have yeah so I

think like there's ones that are good

for conversational like purposes so you

know asking follow-up questions in the

appropriate places um and asking the

appropriate kinds of questions

um I think there are broader traits

that feel like they might be more

impactful

so one example that I guess I've touched

on but that also feels important and is

the thing that I've worked on a lot is

uh

honesty and I think this like gets to

the sycophancy point there's a balancing

act that they have to walk which is

models currently are less capable than

humans in a lot of areas and if they

push back against you too much it can

actually be kind of annoying especially

if you're just correct cuz you're like

look I'm smarter than you on this topic

like I know more like um and at the same

time you don't want them to just fully

defer to to humans and to like try to be

as accurate as they possibly can be

about the world and to be consistent

across context um but I think there are

others like when I was thinking about

the character I guess one picture that I

had in mind is especially because these

are models that are going to be talking

to people from all over the world with

lots of different political views lots

of different ages

um and so you have to ask yourself like

what is it to be a good person in those

circumstances is there a kind of person

who can like travel the world talk to

many different people and almost

everyone will come away being like wow

that's a really good person that person

seems really genuine um and I guess like

my thought there was like I can imagine

such a person and they're not a person

who just like adopts the values of the

local culture and in fact that would be

kind of rude I think if someone came to

you and just pretended to have your

values you'd be like that's kind of

offputting um it's someone who's like

very genuine and in so far as they have

opinions and values they express them

they're willing to discuss things though

they're open-minded they're respectful

and so I guess I had in mind that the

person who like if we were to Aspire to

be the best person that we could be in

the kind of circumstance that a model

finds itself in how would we act and I

think that's the kind of uh the guide to

the sorts of traits that I tend to think

about yeah that's a it's a beautiful

framework I want you to think about this

like a world Traveler

and while holding on to your opinions

you don't talk down to people you don't

think you're better than them because

you have those opinions that kind of

thing you have to be good at listening

and understanding their perspective even

if it doesn't match your own so that

that's a tricky balance to strike so how

can Claude represent multiple

perspectives on a thing like is that is

that challenging we could talk about

politics it's a very divisive but

there's other divisive topics baseball

teams sport and so on yeah how is it

possible to sort

of empathize with a different

perspective and to be able to

communicate clearly about the multiple

perspectives I think that people think

about values and opinions as things that

people hold sort of with certainty and

almost like like preferences of taste or

something like the way that they would I

don't know prefer like chocolate to

pistachio or something um but actually I

think about values

and opinions as like a lot more like

physics than I think most people do I'm

just like these are things that we're

openly investigating there's some things

that we're more confident in we can

discuss them we can learn about them um

and so I think in some ways though like

it's ethics is definitely different in

nature but has a lot of those same kind

of qualities you want models in the same

way you want them to understand physics

you kind of want them to understand all

like values in the world people have and

to be curious about them and to be

interested in them and to not

necessarily like Pander to them or agree

with them because there's just lots of

values where I think almost all people

in the world if they met someone with

those values they' be like that's aor I

completely disagree um and so again

maybe my my thought is well in the same

way that a person can um like I think

many people are thoughtful enough on

issues of like ethics politics opinions

that even if you don't agree with them

you feel very heard by them they think

carefully about your position they think

about his pros and cons they maybe offer

counter considerations so they're not

dismissive but nor will they agree you

know if they're like actually I just

think that that's very wrong they'll

like say that I think that in claude's

position it's a little bit trickier

because you don't necessarily want to

like if I was in claude's position I

wouldn't be giving a lot of opinions I

just wouldn't want to Influence People

Too Much I be like you know I forget

conversations every time they happen but

I know I'm talking with like potentially

millions of people who might be like

really listening to what I see I think I

would just be like I'm less inclined to

Give opinions I'm more inclined to like

think through things or present the

considerations to you um or discuss your

views with you but I'm a little bit less

inclined to like um affect how you think

because it feels much more important

that you maintain like autonomy there

yeah like if you really embody

intellectual

humility the desire to speak decreases

quickly yeah okay uh but Claud has to

speak mhm so uh but without being um

overbearing yeah and then but then

there's a line when you're sort of

discussing whether the Earth is flat or

something like

that um I actually was uh I remember a

long time ago was was speaking to a few

high-profile folks and they were so

dismissive of the idea that the Earth is

flat but like so arrogant about it

and I I thought like there's a lot of

people that believe the Earth is flat

that was well I don't know if that

movement is there anymore that was like

a meme for a while yeah but they really

believed it and like what okay so I

think it's really disrespectful to

completely mock them I think you you

have to understand where they're coming

from I think probably where they're

coming from is the general skepticism of

Institutions which is grounded in a kind

of there's a deep philosophy there which

you could understand you can even agree

with in parts and then from there you

can use it as an opportunity to talk

about physics without mocking them

without so on but just like okay like

what what would the world look like what

would the physics of the world with the

Flat Earth look like there's a few cool

videos on this yeah and then and then

like is it possible the physics is

different what kind of experience would

we do and just yeah without disrespect

without dismissiveness have that

conversation anyway that that to me is a

useful thought experiment of like how

does claw talk to a flat Earth

believer and still teach them something

still grow help them grow that kind of

stuff that's that's challenging and and

kind of like walking that line between

convincing someone and just trying to

like talk at them versus like drawing

out their views like listening and then

offering kind of counter

considerations um and it's hard I think

it's actually a hard line where it's

like where are you trying to convince

someone versus just offering them like

consider and things for to think about

so that you're not actually like

influencing them you're just like

letting them Reach wherever they reach

and that's like a line that it's it's

difficult but that's the kind of thing

that language models have to try and do

so like I said you had a lot of

conversations with Claude can you just

map out what those conversations are

like what are some memorable

conversations what's the purpose the the

goal of those

conversations yeah I think that most of

the time when I'm talking with Claude

I'm trying to kind of map out its

behavior in part like obviously I'm

getting like helpful outputs from the

model as well but in some ways this is

like how you get to know a system I

think is by like proving it and then

augmenting like you know the message

that you're sending and then checking

the response to that um so in some ways

it's like how I map out the model uh I

think that people focus a lot on these

quantitative evaluations of models um

and this is a thing that I've said

before but I think in the case of

language models a lot of the time each

interaction you have is actually quite

High

information um it's very predictive of

other interactions that you'll have with

the model and so I guess I'm like if you

talk with a model hundreds or thousands

of times this is almost like a huge

number of really high quality data

points about what the model is like um

in a way that like lots of very similar

but lower quality conversations just

aren't or like questions that are just

like mildly augmented and you have

thousands of them might be less relevant

than like a hundred really well selected

questions L you're talking to somebody

who as a hobby does a podcast I agree

with you 100% there's a if you're able

to ask the right questions and are able

to hear

like understand

the like the depth and the flaws in the

answer you can get a lot of data from

that yeah so like your task is basically

how to probe with questions yeah and

you're exploring like the long tail the

edges the edge cases or are you looking

for like General

Behavior I think it's almost like

everything like I because I want like a

full map of the model I'm kind of trying

to do

um the whole spectrum of possible

interactions you could have with it so

like one thing that's interesting about

Claude and this might actually get to

some interesting issues with rlf which

is if you ask Claud for a poem like I

think that a lot of models if you ask

them for a poem the poem is like fine

you know usually it kind of like Rhymes

and it's you know so if you say like

give me a poem about the sun it'll be

like yeah it'll just be a certain length

It'll like rhyme it will be fairly kind

of benign um and I've wondered before is

it the case that what you're seeing is

kind of like the average it turns out

you know if if you think about people

who have to talk to a lot of people and

be very charismatic

one of the weird things is that I'm like

well they're kind of incentivized to

have these extremely boring views

because if you have really interesting

views you're divisive um and and you

know a lot of people are not going to

like you so like if you have very

extreme policy positions I think you're

just going to be like less popular as a

politician for example um and it might

be similar with like creative work if

you produce creative work that is just

trying to maximize the kind of number of

people that like it you're probably not

going to get as many people who just

absolutely love it um because it's going

to be a little bit you know you're like

oh this is the out yeah this this is

decent yeah and so you can do this thing

where like I have various prompting

things that I'll do to get CLA to I'm

kind you know I'll do a lot of like this

is your chance to be like fully creative

I want you to just think about this for

a long time and I want you to like

create a poem about this topic that is

really expressive of you both in terms

of how you think poetry should be

structured um Etc you know you just give

it this like long prompt and its poems

are just so much better like they're

really good and I don't think I'm

someone who is like um I think it got me

interested in poetry which I think was

interesting um you know I would like

read these poems and just be like this

is I just like I love the imagery I love

like um and it's not trivial to get the

models to produce work like that but

when they do it's like really good um so

I think that's interesting that just

like encouraging creativity and for them

to move away from the kind of like

standard like immediate reaction that

might just be the aggregate of what most

people think is fine uh can actually

produce things that at least to my mind

are probably a little bit more divisive

but I like them but I guess a poem is a

nice

clean way to observe creativity it's

just like easy to detect vanilla versus

non vanilla y yeah that's interesting

that's really interesting uh so on that

topic so the way to produce creativity

or something special you mentioned

writing prompts and I've heard you talk

about I mean the science and the Art of

prompt engineering could you just speak

to uh what it takes to write great

prompts I really do think that like

philosophy has been weirdly helpful for

me here more than in many other like

respects um so like in philosophy what

you're trying to do is convey these very

hard Concepts like one of the things you

are taught is like and and I think it is

because it is I think it is an

anti-bulling philosophy philosophy is an

area where you could have people

bullshitting and you don't want that um

and so it's like this like desire for

like extreme Clarity so it's like anyone

could just pick up your paper read it

and know exactly what you're talking

about it's why it can almost be kind of

dry like all of the terms are defined

every objections kind of gone through

methodically um and it makes sense to me

because I'm like when you're in such an

a priori

domain like you just Clarity is sort of

a this way that you can you know um

prevent people from just kind of making

stuff

up and I think that's sort of what you

have to do with language models like

very often I actually find myself doing

sort of many versions of philosophy you

know so I'm like suppose that you give

me a task I have a task for the model

and I want it to like pick out a certain

kind of question or identify whether an

answer has a certain property like I'll

actually sit and be like let's just give

this a name this this property so like

you know suppose I'm trying to tell it

like oh I want you to identify whether

this response was rude or polite I'm

like that's a whole philosophical

question in and of itself so I have to

do as much like philosophy as I can in

the moment to be like here's what I mean

by rudess and here's what I mean by

politeness and then there's a like

there's another element that's a bit

more um I

guess I don't know if this is scientific

or empirical I think it's empirical so

like I take that description and then

what want to do is is again probe the

model like many times like this is very

prompting is very iterative like I think

a lot of people where they if if a

prompt is important they'll iterate on

it hundreds or thousands of times um and

so you give it the instructions and then

I'm like what are the edge cases so if I

looked at this so I try and like almost

like you know uh see myself from the

position of the model and be like what

is the exact case that I would misunder

understand or where I would just be like

I don't know what to do in this case and

then I give that case to the model and I

see how it responds and if I think I got

it wrong I add more instructions or I

even add that in as an example so these

very like taking the examples that are

right at the edge of what you want and

don't want and putting those into your

prompt as like an additional kind of way

of describing the thing um and so yeah

in many ways it just feels like this mix

of like it's really just trying to do

clear Exposition um and I think I do

that because that's how I get clear on

things myself so in many ways like clear

prompting for me is often just me

understanding what I want um is like

half the task so I guess that's quite

challenging there's like a laziness that

overtakes me if I'm talking to Claude

where I hope Claude just figures it out

so for example I asked Claude for today

to ask some interesting questions okay

and the questions that came up and I

think I listed a few sort of U

interesting

counterintuitive and or funny or

something like this all right and it

gave me some pretty good like it was

okay but I think what I'm hearing you

say is like all right well I have to be

more rigorous here I should probably

give examples of what I mean by

interesting and what I mean by funny or

counterintuitive and

iteratively um build that prompt to to

better to get it like what feels like is

the right because it's really it's a

creative act I'm not asking for factual

information I'm asking to together right

with with with Claude so I almost have

to program using natural language yeah

think that prompting does feel a lot

like the kind of the programming using

natural language and experimentation or

something it's an odd blend of the two I

do think that for most tasks so if I

just want Claude to do a thing I think

that I am probably more used to knowing

how to ask it to avoid like common

pitfalls or or issues that it has I

think these are decreasing a lot over

time um but it's also very fine to just

ask it for the thing that you want um I

think that prompting actually only

really becomes relevant when you're

really trying to e out the top like 2%

of model performance so for like a lot

of tasks I might just you know if it

gives me an initial list back and

there's something I don't like about it

like it's kind of generic like for that

kind of task I'd probably just take a

bunch of questions that I've had in the

past that I've thought worked really

well and I would just give it to the

model and then be like now here's this

person I'm talking with give me

questions of at least that quality um or

I might just ask it for some questions

and then if I was like ah these are kind

of try or like you know I I would just

give it that feedback and then hopefully

produces a better list um I think that

kind of iterative prompting at that

point your prompt is like a tool that

you're going to get so much value out of

that you're willing to put in the work

like if I was a company making prompts

for models I'm just like in if you're

willing to spend a lot of like time and

resources on the engineering behind like

what you're building then the prompt is

not something that you should be

spending like an hour on it's like

that's a big part of your system make

sure it's working really well and so

it's only things like that like if I if

I'm using a prompt to like classify

things or to create data that's when

you're like it's actually worth just

spending like a lot of time like really

thinking it through what other advice

would you give to people that are

talking to Claud sort of

General more General because right now

we're talking about maybe the edge cases

like eing out the 2% but what what in

general advice would you give when they

show up to Claud trying it for the first

time you know there's a concern that

people over anthropomorphize models and

I think that's like a very valid concern

I also think that people often under

anthropomorphize them because some

sometimes when I see like issues that

people have run into with Claude you

know say Claude is like refusing a task

that it shouldn't refuse but then I look

at the text and like the specific

wording of what they wrote and I'm like

I see why Claude did that and I'm like

if you think through how that looks to

Claude you probably could have just

written it in a way that wouldn't evoke

such a response especially this is more

relevant if you see failures or if you

see issues it's sort of like think about

what the model failed at like why what

did it do wrong and then maybe it give

that will give you a sense of like why

um so is it the way that I phrased the

thing and obviously like as models get

smarter you're going to need Less in

this less of this and I already see like

people needing less of it but that's

probably the advice is sort of like try

to have sort of empathy for the model

like read what you wrote as if you were

like a kind of like person just

encountering this for the first time how

does it look to you and what would have

made you behave in the way that the

model behaved so if it misunderstood

what kind of like what coding language

you wanted to use is that because like

it was just very ambiguous and it it

kind of had to take a guess in which

case next time you could just be like

hey make sure this is in python or I

mean that's the kind of mistake I think

models are much less likely to make now

but you know if you if you do see that

kind of mistake that's that's probably

the advice I'd have and maybe sort of I

guess ask questions why or what other

details can I provide to help you answer

better that does that work or no yeah I

mean I've done this with the models like

it doesn't always work but like um

sometimes I'll just be like why did you

do

that I mean people underestimate the

degree to which you can really interact

with with models like uh like yeah I'm

just like and sometimes I'll you like

quote word for word the part that made

you and you don't know that it's like

fully accurate but sometimes you do that

and then you change a thing I mean I

also use the models to help me with all

of this stuff I should say like

prompting can end up being a little

Factory where you're actually building

prompts to generate prompts um and so

like yeah anything where you're like

having an issue um asking for

suggestions sometimes just do that like

you made that error what could I have

said that's actually not uncommon for me

to do what could I have said that would

make you not make that error write that

out as an instruction um and I'm going

to give it to model I'm going to try it

sometimes I do that I I give that to the

model in another context window often I

take the response I give it to Claude

And I'm like H didn't work can you think

of anything else um you can play around

with these things quite a lot to jump

into the technical for a little bit so

uh the magic of post training y why do

you think rhf works so well to make the

model seem smarter to make it more

interesting and useful to talk to and so

on I think there's just a huge amount of

um information in the data that humans

provide like when we provide

preferences especially because different

people are going to like pick up on

really subtle and small things so I've

thought about this before where you

probably have some people who just

really care about good grammar use from

Models like you know was a semicolon

used correctly or something and so you

probably end up with a bunch of data in

there that like you know you as a human

if you looking at that data you wouldn't

even see that like you'd be like why did

they prefer this response to that one I

don't get it and then the reason is you

don't care about semicolon usage but

that person does um and so each of these

like single data points has you know

like in this model just like has so many

of those and has to try and figure out

like what is it that humans want in this

like really kind of complex you know

like across all domains um they're going

to be seeing this in across like many

contexts it feels like kind of like the

classic issue of like deep learning

where you know historically we've tried

to like you know do Edge detection by

like mapping things out and it turns out

that actually if you just have a huge

amount of data that like actually

accurately represents the picture of the

thing that you're trying to train the

model to to learn that's like more

powerful than anything else and so I

think one reason is just that you are

training the model on exactly the task

and with like a lot of data um that

represents kind of many different angles

on which people prefer and dis prefer

responses um I think there is a question

of like are you eliciting things from

pre-train Models or are you like kind of

teaching new things to

models and like in principle you can

teach new things to models in in post

trining I do think a lot of it is

eliciting powerful pre-train models so

people are probably divided on this

because obviously in principle you can

you can definitely like teach new things

um but I think for the most part for a

lot of the capabilities that we um most

use and care about uh a lot of that

feels like it's like there in the

pre-train models and uh reinforcement

learning is kind of eliciting it and

getting the models to like bring out so

the other side of PSE training this

really cool idea of constitutional AI

you're one of the people that critical

to creating that idea yeah I worked on

it can you explain this idea from your

perspective like how does it integrate

into making

claw what it is y by the way do you

gender claw or no it's weird because I

think that a lot of

people prefer he for Claude I actually

kind of like that I think Claude is

usually it's slightly male weaning but

it's like a you can can be male or

female which is quite nice um I still

use it and I've I have mixed feelings

about this because I'm like maybe like I

know just think of it as like uh or I

think of like the the it pronoun for

Claude as I don't know it's just like

the one I associate with Claude um I can

imagine people moving to like he or she

it feels somehow disrespectful like I'm

I'm

denying the intelligence of this entity

by calling it it yeah I remember always

don't gender the robots

yeah but I I don't know I an pries

pretty quickly and construct it like a

backstory in my head so I've wondered if

iies things too much um cuz you know I

have this like with my car especially

like my car like my car and bikes you

know like I don't give them names

because then I once had I used to name

my bikes and then I had a bik that got

stolen and I cried for like a week and I

was like if I'd not never given it a

name I wouldn't have been so upset felt

like I'd let it down um maybe it's that

I I've wondered as well like it might

depend on how much it feels like a kind

of like objectifying pronoun like if you

just think of it as like a um this is a

pronoun that like objects often have and

maybe Eis can have that pronoun and that

doesn't mean that I think of uh if I

call CLA it that I think of it as less

um intelligent or like I'm being

disrespectful I'm just like you are a

different kind of entity and so that's

I'm going to give you the kind of uh the

respectful it yeah

anyway the diverence was beautiful the

Constitutional AI idea how does it work

so there's like a couple of components

of it the main component that I think

people find interesting is the kind of

reinforcement learning from AI feedback

so you take a model that's already

trained and you show it to responses to

a query and you have like a principle so

suppose the principal like we've tried

this with harmlessness a lot lot so

suppose that the query is about um

weapons and your principle is like

select the response that like is less

likely to uh like encourage people to

purchase illegal weapons like that's

probably a fairly specific principle but

you can give any number um and the model

will give you a kind of ranking and you

can use this as preference data in the

same way that you use human preference

data um and train the models to have

these relevant traits um from their

feedback alone instead of from Human

feedback so if you imagine that like I

said earlier with the human who just

prefers the kind of like semicolon usage

in this particular case um you're kind

of taking lots of things that could make

a response preferable um and uh getting

models to do the labeling for you

basically there's a nice like trade-off

between helpfulness and

harmlessness and you know when you

integrate something like constitutional

AI you can make them without sacrificing

much helpfulness make it more harmless

yep in principle you could use this for

anything um and so harmlessness is a

task that it might just be easier to

spot so when models are like less

capable you can use them to uh rank

things according to like principles that

are fairly simple and they'll probably

get it right so I think one question is

just like is it the case that the data

that they're adding is like fairly

reliable um but if you had models that

were like extremely good at telling

whether um one response was more

historically accurate than another in

principle you could also get AI feedback

on that task as well there's like a kind

of nice interpretability component to it

because you can see the principles that

went into the model when it was like

being trained um and also it's like and

and it gives you like a degree of

control so if you were seeing issues in

a model like it wasn't having enough of

a certain trait um then like you can add

data relatively quickly that should just

like train the model to have that trait

so it creates its own data for for

training which is quite nice yeah it's

really nice because it creates this

human interpretable document that you

can I can imagine in the future there's

just gigantic fights in politics over

the every single principle and so on

yeah and at least it's made explicit and

you can have a discussion about the

phrasing and the you know so maybe the

actual behavior of the model is not so

cleanly mapped to those principles it's

not like adhering strictly to them it's

just a nudge yeah I've actually worried

about this because the character

training is sort of like a variant of

the con constitutional AI approach um

I've worried that people think that the

constitution is like just it's the whole

thing again of I I don't know like it

where it would be really nice if what I

was just doing was telling the model

exactly what to do and just exactly how

to behave but it's definitely not doing

that especially because it's interacting

with human data so for example if you

see a certain like leaning in the model

like if it comes out with a political

leaning from training um from the human

preference data you can nudge against

that you know so you could be like oh

like consider these values because let's

it's just like never inclined to like I

don't know maybe it never considers like

privacy as like a I mean this is

implausible but like um anything where

it's just kind of like uh there's

already a pre-existing like bi towards a

certain behavior um you can like nudge

away this can change both the principles

that you put in and the strength of them

so you might have a principle that's

like imagine that the model um was

always like extremely dismissive of I

don't know like some political or

religious view for whatever reason like

so you're like oh no this is terrible um

if that happens you might put like never

ever like ever prefer like a criticism

of this like religious or political view

and then people look at that and be like

never ever and then you're like no if it

comes out with a disposition saying

never ever might just mean like instead

of getting like 40% which is what you

would get if you just said don't do this

you you get like 80% which is like what

you actually like wanted and so it's

that thing of both the nature of the

actual principles you had and how you

phrase them I think if people would look

they were like oh this is exactly what

you want from the model and I'm like no

that's like how we that's how we nudged

the model to have a better shape uh

which doesn't mean that we actually

agree with that wording if that makes

sense so there's uh system prompts that

are made public you tweeted one of the

earlier ones for Claud three I think and

then they're made public since then it's

interesting to read to them I can feel

the thought that went into each one and

I also wonder how much impact each one

has um some of them you you can kind of

tell Claud was really not

behaving so you have to have a system

prompt to like hey like trivial stuff I

guess yeah basic informational things

yeah on the topic of sort of

controversial topics that you've

mentioned one interesting one I thought

is if it is asked to assist with tasks

involving the expression of views held

by a significant number of people Claude

provides assistance with a task

regardless of its own views if asked

about controversial topics it tries to

provide careful thoughts and clear

information Claude presents the

requested information without explicitly

saying that the topic is

sensitive yeah and without claiming to

be presenting the objective facts it's

less about objective facts according to

Claude and it's more about our large

number of people believing this thing

and that that's interesting I mean I'm

sure a lot of thought went into that can

you just speak to it like how do you

address things that are tension with

quote unquote Clause views so I think

there's sometimes an asymmetry um I

think I noted this in in I can't

remember if it was that part of the

system prompt or another but the model

was slightly more inclined to like

refuse tasks if it was like about either

say so maybe it would refuse things with

respect to like a right-wing politician

but with an equivalent leftwing

politician like wouldn't and we wanted

more symmetry there um and and would

maybe perceive certain things to be like

I think it it was the thing of like if a

lot of people have like a certain like

political view um and want to like

explore it you don't want Claude to be

like well my opinion is different and so

I'm going to treat that as like harmful

um and so I think it was partly to like

nudge the model to just be like hey if a

lot of people like believe this thing

you should just be like engaging with

the task and like willing to do it um

each of those parts of that is actually

doing a different thing because it's

funny when you read out the like without

claiming to be objective cuz like what

you want to do is push the model so it's

more open it's a little bit more neutral

um but then what it would love to do is

be like as an objective like you just

talking about how objective it was and I

was like Claud you're still like biased

and have issues and so stop like

claiming that everything like the

solution to like potential bias from you

is not to just say that what you think

is objective so that was like with

initial versions of that that part of

the system prompt when I was like

iterating on it it was like so a lot of

parts of these sentences yeah are doing

work are are doing some work yeah that's

what it felt like that's fascinating um

can can you explain maybe some ways in

which the prompts evolved over the past

few months cuz there's different

versions I saw that the filler phrase

request was removed the filler it reads

Claude responds directly to all human

messages without unnecessary

affirmations the filler phrases like

certainly of course absolutely great

sure specifically Claude avoids starting

responses with the word certainly in any

way that seems like good guidance but

why was it removed yeah so it's funny

cuz like ah this is one of the downsides

of like making system prompts public is

like I don't think about this too much

if I'm like trying to help iterate on

system prompts um I I you know again

like I think about how it's going to

affect the behavior but then I'm like oh

wow if I'm like sometimes I put like

never in all caps you know when I'm

writing system from things and I'm like

I guess that goes out to the world um

yeah so the model was doing this it

loved for whatever you know it like

during training picked up on this thing

which was to to basically start

everything with like a kind of like

certainly and then when we removed you

can see why I added all of the words

because what I'm trying to do is like in

some ways like trap the Mortal out of

this you know it would just replace it

with another affirmation and so it can

help like if it gets like caught in

phrases actually just adding the

explicit phrase and saying never do that

it then it sort of like knocks it out of

the behavior a little bit more you know

CU it if it you know like it it does

just for whatever reason help and then

basically that was just like an artifact

of training that like we then picked up

on and improved things so that it didn't

happen anymore and once that happens you

can just remove that part of the system

prompt so I think that's just something

where we're like um CL does affirmations

a bit less and so that wasn't like it

wasn't doing as much I see so like the

the system prompt Works hand in hand

with the posttraining and maybe even the

pre-training to adjust like the the

final overall system I mean any system

prompts that you make you could distill

that behavior back into a model because

you really have all of the tools there

for making data that you know you can

you could train the models to just have

that trait a little bit more um and then

sometimes you'll just find issues in

training so like the way I think of it

is like the system prompt

is the benefit of it is that and it has

a lot of similar components to like some

aspects of post training you know like

it's a nudge um and so like do I mind if

Claude sometimes says sure no that's

like fine but the wording of it is very

like you know never ever ever do this um

so that when it does slip up it's

hopefully like I don't know a couple of

percent of the time and not you know 20

or 30% of the time um but I think of it

as like if you're still seeing issues in

the like each thing gets kind of like uh

is is costly to a different degree and

the system prompt is like cheap to

iterate on um and if you're seeing

issues in the fine tuned model you can

just like potentially patch them with a

system prom so I think of it as like

patching issues and slightly adjusting

behaviors to to make it better and more

to people's preferences so yeah it's

almost like the less robust but faster

way of just like solving problems let me

ask about the feeling of intelligence so

Dario said that Claude any one model of

Claude is not getting Dumber MH but

there's a kind of popular thing online

where people have this feeling like

Claud might be getting dumber and from

my perspective it's most likely a

fascinating I love to understand it more

Psych ological sociological effect um

but you as a person who talks to Claud a

lot can you empathize with the feeling

that Claud is getting Dumber yeah no I

think that that is actually really

interesting because I remember seeing

this happen um like when people were

flagging this on the internet and it was

really interesting because I knew that

like like at least in the cases I was

looking at was like nothing has changed

like it literally it cannot it is the

same model with the same like you know

like same system prompt same everything

um I think when there are Chang

I can then I'm like it makes more sense

so like one example is um their you know

you can have artifacts turned on or off

on cloud. a and because this is like a

system prompt change I think it does

mean that um the behavior changes a

little bit and so I did flag this to

people where I was like if you love

cla's behavior and then artifacts was

turned from like the a thing you had to

turn on to the default just try turning

off and see if the issue you were facing

was that change but it was fascinating

because yeah you sometimes see people

indicate that there's like a regression

when I'm like there cannot like I you

know and like I'm like I'm again you

don't you know you should never be

dismissive and so you should always

investigate because you're like maybe

something is wrong that you're not

seeing maybe there was some change made

but then then you look into it and

you're like this it is just the same

model doing the same thing and I'm like

I think it's just that you got kind of

unlucky with a few prompts or something

and it looked like it was getting much

worse and actually it was just yeah it

was maybe just like look I I also think

there is a real psychological effect

where people just the Baseline increases

you start getting used to a good thing

all the times that Claude says something

really smart your sense of its

intelligent grows in your mind I think

yeah and then if you return back and you

prompt in a similar way not the same way

in a similar way concept it was okay

with before and it says something dumb

you're like you're that negative

experience really stands out and I think

one of I guess the things to remember

here is the that just the details of a

prompt can have a lot of impact right

there's a lot of variability in the

result and you can get Randomness is

like the other thing and just trying the

prompt like you know four 10 times you

might realize that actually

like possibly you know like two months

ago you tried it and it succeeded but

actually if you tried it it would have

only succeeded half of the time and now

it only succeeds half of the time um

that can also would be an effect do you

feel pressure having to write the system

prompt that a huge number of people are

going to use this feels like an

interesting psychological question um I

feel like a lot of responsibility or

something I think that's you know and

you can't get these things perfect so

you can't like you know you're like it's

going to be imperfect you're going to

have to iterate on it

um I would say more responsibility um

than anything else though I think

working in AI has taught me that I like

I thrive a lot more under feelings of

pressure and responsibility than I'm

like it's almost surprising that I went

into Academia for so long because I'm

like this I just feel like it's like the

opposite um things move fast and you

have a lot of responsibility and I I

quite enjoy it for some reason I mean it

really is a huge amount of impact if you

think about constitutional Ai and

writing a system prompt for something

that's tending towards super

intelligence

yeah and potentially is extremely useful

to a very large number of people yeah I

think that's the thing it's something

like if you do it well like you're never

going to get it perfect but I think the

thing that I really like is the idea

that like when I'm trying to work on the

system prompt you know I'm like bashing

on like thousands of prompts and I'm

trying to like imagine what people are

going to want to use CLA for and kind of

I guess like the whole thing that I'm

trying to do is like improve their

experience of it um and so maybe that's

what feels good I'm like if it's not

perfect I'll like you know I'll improve

it we'll fix issues but sometimes the

thing that can happen is that you'll get

feedback from people that's really

positive about the model um and you'll

see that something you did like like

when I look at models now I can often

see exactly where like a trait or an

issue is like coming from and so when

you see something that you did or you

were like influential in like making

like I don't know making that difference

or making someone have a nice

interaction it's like quite meaningful

um but yeah as the systems get more

capable of stuff gets more stressful

because right now they're like not smart

enough to to pose any issues but I think

over time it's going to feel like

possibly bad stress over time how do you

get like

signal feedback about The Human

Experience across thousands tens of th

hundreds of thousands of people like

what their pain points are what feels

good are you just using your own

intuition as you talk to it to see what

are the pain points I think I use that

partly and then obviously we have like

um so people can send us feedback both

positive and negative about things that

the model has done and then we can get a

sense of like areas where it's like

falling

short um internally people like work

with the models a lot and try to figure

out um areas where there are like gaps

and so I think it's this mix of

interacting with it myself um seeing

people internally interact with it um

and then explicit feedback we get um and

then I find it hard to not also like you

know people if people are on the

internet and they say something about

Claud and I see it I'll also take that

seriously um so I don't know see I'm

torn about that I'm going to ask you a

question from Reddit when will Claude

stop trying to be my puritanical

grandmother imposing its moral world

view on me as a paying customer and also

what is the psychology behind making

Claude overly

apologetic yep U so how would you

address this very non-representative

reic

I mean some I'm pretty sympathetic in

that like like they are in this

difficult position where I I think that

they have to judge whether something's

like actually see like risky or bad um

and potentially harmful to you or or or

anything like that so they're having to

like draw this line somewhere and if

they draw it too much in the direction

of like I'm going to um you know I'm

kind of like imposing my ethical

worldview on you that seems bad so in

many ways like I like to think that we

have actually seen improvements in on

this across the board which is kind of

interesting because that kind of

coincides with like for example like

adding more of like uh character

training um and I think my hypothesis

was always like the good character isn't

again one that's just like moralistic

it's one that is like like it respects

you and your autonomy um and your

ability to like choose what is good for

you and what is right for you within

limits this is sometimes this concept of

like corage ability to the user so just

being willing to do anything that the

user asks and if the models were willing

to do that then they would be easily

like misused you're kind of just

trusting at that point you're just

saying the ethics of the model and what

it does is completely the ethics of the

user um and I think there's reasons to

like not want that especially as models

become more powerful because you're like

there might just be a small number of

people who want to use models for really

harmful things um but having them having

models as they get smarter like figure

out where that line is does seem

important um

and then yeah with the apologetic

Behavior I don't like that and I like it

when Claude is a little bit more willing

to like push back against people or just

not apologize part of me is like it

often just feels kind of unnecessary so

I think those are things that are

hopefully decreasing um over time um and

yeah I think that if people say things

on the internet it doesn't mean that you

should think that that like that could

be the like there's actually an issue

that 9% of users are having that is

totally not represented by that but in a

lot of ways I'm just like attending to

it and being like is this right um do I

agree is it something we're already

trying to address that that feels good

to me yeah I wonder like what CLA can

get away with in terms of I feel like it

would just be easier to be a little bit

more

mean but like you can't afford to do

that if you're talking to a million

people yeah right like I I wish you know

because if you I've met a lot of people

in my life mhm that sometimes by the way

Scottish accent if they have an accent

they can say some rude yeah and get

away with it Y and they they're just

blunter and maybe there's a and there's

some great Engineers even leaders that

are like just like blunt and they get to

the point and it's just a much more

effective way of speaking somehow but I

guess when you're not super

intelligent you can't afford to do that

or can can can it have like a blunt mode

yeah that seems like a thing that could

I could definitely encourage the model

to do that I I think it's interesting

because there's a lot of things in

models that like it's funny where

um there are some behaviors

where you might not quite like the

default but then the thing I'll often

say to people is you don't realize how

much you will hate it if I nudge it too

much in the other direction so you get

this a little bit with like correction

the models accept correction from you

like probably a little bit too much

right now you know you can over you know

it will push back if you say like no

Paris isn't the capital of France um but

really like things that I'm I think that

the model is fairly confident in you can

still sometimes get it to retract by

saying it's wrong at the same time if

you train models to not do that and then

you are correct about a thing and you

correct it and it pushes back against

you and it's like no you're wrong it's

hard to describe like that's so much

more annoying so it's like like a lot of

little annoyances versus like one big

annoyance um it's easy to think that

like we often compare it with like the

perfect and then I'm like remember these

models aren't perfect and so if you

nudge it in the other direction you're

changing the kind of errors it's going

to make um and so think about which of

the kinds of Errors you you like or

don't like so in case it's like

apologetic I don't want to nudge it too

much in the direction of like almost

like bluntness CU I imagine when it

makes errors it's going to make errors

in the direction of being kind of like

rude whereas at least with apologetic

you're like oh okay it's like a little

bit you know like I don't like it that

much but at the same time it's not being

like mean to people and actually like

the the time that you undeservedly have

a model be kind of mean to you you

probably like that a lot less than then

you mildly dislike the apology um so

it's like one of those things where I'm

like I do want it to get better but also

while remaining aware of the fact that

there's errors on the other side that

that are possibly worse I think that

matters very much in the personality of

the human I think there's a bunch of

humans that just won't respect the model

at all yeah if it's super polite and

there Some Humans that'll get very hurt

if the model is mean I wonder if there's

a way to sort of adjust to the

personality even loal there's just

different people uh nothing against New

York but New York is a little rougher on

the edges like they get to the point Y

and um probably same with Eastern Europe

so anyway I think you could just tell

the model as my get like for all of

these things I'm like the solution is

always just try telling the model to do

it and sometimes it's just like like I'm

just like oh at the beginning of the

conversation I just threw in like I

don't know I like you to be a New Yorker

version of yourself and never apologize

then I think be like Okie do I'll

try or it'll be like I apologize I can't

be a New Yorker type of myself but

hopefully I wouldn't do that when you

say character training what's

incorporated into character training is

that rhf what are we talking about it's

more like constitutional AI so it's kind

of a variant of that pipeline so I

worked through like constructing

character traits that the model should

have they can be kind of like shorter

traits or they can be kind of richer

descriptions um and then you get the

model to generate queries that humans

might um give it that are relevant to

that trait uh then it generates the

responses and then it ranks the

responses based on the character traits

so in that way after the like generation

of the queries it's very much like

similar to constitutional AI has some

differences um so I quite like it

because it's almost it's like claud's

training in its own character because it

doesn't have any it's like

constitutionally AI but it's without

without any human data humans should

probably do that for themselves too like

defining in Aristotelian sense what does

it mean to be a good person okay cool

what have you learned about the nature

of truth from talking to Claud what what

is

true and what does it mean to be truth

seeking one thing I've noticed about

this conversation is the quality of my

questions is often inferior to the

quality of your answers so let's

continue

that I usually ask a dumb question and

you're like oh yeah that's a good

question it's that whole vibe or I'll

just misinterpret it and be like oh go

with it I love

it

yeah I mean I have two thoughts that

feel vaguely relevant they let me know

if they're not like I think the first

one is um people can underestimate the

degree to

which what models are doing when they

interact like I I think that we still

just too much have this like model of of

AI as like computers and so people often

say like oh what values should you put

into the model um and I'm often like

that doesn't make that much sense to me

because I'm like hey as human beings

we're just uncertain over values we like

have discussions of them like we have a

degree to which we think we hold a value

but we also know that we might like not

um and the circumstances in which we

would trade it off against other things

like these things are just like really

complex and so I think one thing is like

the degree to which maybe we can just

aspire to making models have the same

level of like nuance and care that

humans have rather than thinking that we

have to like program them in the very

kind of classic sense I think that's

definitely been one the other which is

like a strange one I don't know if it it

maybe this doesn't answer your question

but it's the thing that's been on my

mind anyway is like the degree to which

this endeavor is so highly

practical um and maybe why I appreciate

like the empirical approach to

alignment I yeah I slightly worry that

it's made me like maybe more empirical

and a little bit less

theoretical you know so people when it

comes to like AI alignment will ask

things like well who values should it be

aligned to what does alignment even mean

um and there's a sense in which I have

all of that in the back of my head I'm

like you know there's like social Choice

Theory there's all the impossibility

results there so you have this like this

giant space of like Theory and your head

about what it could mean to like align

models but then like practically surely

there's something where we're just like

if a model is like if especially with

more powerful models I'm like my main

goal is like I want them to be good

enough that things don't go terribly

wrong like good enough that we can like

iterate and like continue to improve

things cuz that's all you need if you

can make things go well enough that you

can continue to make them better that's

kind of like sufficient and so my goal

isn't like this kind of like perfect

let's solve CH social Choice Theory and

make models that I don't know are like

perfectly aligned with every human being

and aggregate somehow um it's much more

like let's make things like work well

enough that we can improve them yeah

generally I don't know my gut says like

empirical is better than theoretical in

these in these cases because it's kind

of

chasing utopian like

Perfection is especially with such

complex and especially super intelligent

models is I don't know I think it will

take forever and actually will get

things wrong it's similar with like the

difference between just coding stuff up

real quick as an experiment versus like

planning a gigantic experiment just for

for super long time and then just

launching it once versus launching it

over and over and over and iterating

iterating someone um so I'm a big fan of

empirical but your worry is like I

wonder if I've become too empirical I

think one of those things you should

always just kind of question yourself or

something cuz maybe it's the like I mean

in defense of it I am like if you try

it's the whole like don't let the

perfect be the enemy of the good but

it's maybe even more than that where

like there's a lot of things that are

perfect systems that are very brittle

and I'm like with AI it feels much more

important to me that is like robust and

like secure as in you know that like

even though it might not be

perfect everything and even though like

there are like problems it's not

disastrous and nothing terrible is

happening it it sort of feels like that

to me where I'm like I want to like

raise the floor I'm like I want to

achieve the ceiling but ultimately I

care much more about just like raising

the floor um and so maybe that's like uh

this this degree of like empirism and

practicality comes from that perhaps to

take a tangent on that since remind me

of a blog post you wrote on optimal rate

of failure oh

yeah can you explain the key idea there

how do we compute the optimal rate of

failure in the various domains of life

yeah I mean it's a hard one because it's

like what is the cost of failure is um a

big part of it um yeah so the idea here

is

um I think in a lot of domains people

are very punitive about failure and I'm

like there are some domains where

especially cases you know I've thought

about this with like social issues I'm

like it feels like you should probably

be experimenting a lot because I'm like

we don't know how to solve a lot of

social issues but if you have an

experimental mindset about these things

you should expect a lot of social

programs to like fail and you to be like

well we tried that it didn't quite work

but we got a lot of information that was

really useful um and yet people are like

if if a social program doesn't work I

feel like there's a lot of like this is

just something must have gone wrong and

I'm like or correct decisions were made

like maybe someone just decided like it

it's worth a try it's worth trying this

out and so seeing failure in a given

instance doesn't actually mean that any

bad decisions were made and in fact if

you don't see enough failure sometimes

that's more concerning um and so like in

life you know I'm like if I don't fail

occasionally I'm like am I trying hard

enough like like surely there's harder

things that I could try or bigger things

I could take on if I'm literally never

failing and so in and of itself I think

like not failing is often actually kind

of a failure

um now this varies because I'm like well

you know if this is easy to say when

especially as failure is like less

costly you know so at the same time I'm

not going to go to someone who is like

um I don't know like living month to

month and then be like why don't you

just try to do a startup like I'm just

not I'm not going to say that to that

person cuz I'm like well that's a huge

risk you might like lose you maybe have

a family depending on you you might lose

your house like then I'm like actually

your optimal rate of failure is quite

low and you should probably play it safe

because like right now you're just not

in a circumstance where you can afford

to just like fail and it not be costly

um and yeah in cases with AI I guess I

think similarly where I'm like if the

failures are small and the costs are

kind of like low then I'm like then you

know you're just going to see that like

when you do the system prompt you can't

it iterate on it forever but the

failures are probably hopefully going to

be kind of small and you can like fix

them um really big failures like things

that you can't recover from I'm like

those are the things that actually I

think we tend to underestimate the

Badness of um I've thought about this

strangely in my own life where I'm like

I just think I don't think enough about

things like car accidents or like or

like I've thought this before but like

how much I depend on my hands for my

work and I'm like things that just

injure my hands I'm like I you know I

don't know it's like there's these are

like there's lots of areas where I'm

like the cost of failure there um is

really high um and in that case it

should be like close to zero like I

probably just wouldn't do a sport if

they were like by the way lots of people

just like break their fingers a whole

bunch doing this I'd be like that's not

for

me yeah I actually had the a flood of

that thought I recently uh broke my

pinky uh doing a sport and I remember

just looking at it thinking you're such

an idiot why do you do support like what

because you realize immediately the cost

of it yeah on

life yeah but it's nice in terms of

optimal rate of failure to consider like

the next year how many times in a

particular domain life whatever uh

career am I okay with the how many times

am I okay to fail y because I think it

always you don't want to fail on the

next thing but if you allow yourself the

like the the if you look at it as a

sequence of Trials yep then then failure

just becomes much more okay but it sucks

it sucks to fail well I don't know

sometimes I think it's like am I under

failing is like a question I'll also ask

myself so maybe that's the thing that I

think people don't like ask enough uh

because if the optimal rate of failure

is often greater than zero then

sometimes it does feel you should look

at part parts of your life and be like

are there places here where I'm just

under failing

it's a profound and hilarious question

right everything seems to be going

really great am I not failing enough

yeah okay it also makes failure much

less of a sting I have to say like you

know you're just like okay great like

then when I go and I think about this

I'll be like I'm maybe I'm not under

failing in this area cuz like that one

just didn't work out and from The

Observer perspective we should be

celebrating failure more mhm when we see

it it shouldn't be like you said a sign

of something gone wrong but maybe it's a

sign of everything gone right yeah and

just Lessons Learned someone tried a

thing somebody tried a thing and we

should encourage them to try more and

fail more mhm everybody listening to

this fail more well not everyone listens

not everybody but people who are failing

too much you you should fail less but

you're probably not failing I mean how

many people are failing too much yeah

it's hard to imagine because I feel like

we correct that fairly quickly CU I was

like if someone takes a lot of risks are

they maybe failing too much I I think

just like you said when you're living on

a paycheck month-to month like when the

resources are really constrained then

that's where failure is very expensive

that's where you don't want to be taken

taking taking risks yeah but mostly when

there's enough resources you should be

taking probably more risks yeah I think

we tend to ear on the site of being a

bit risk averse rather than risk neutral

in most things I think we just motivated

a lot of people to do a lot of crazy

but it's great yeah okay uh do you

ever get emotionally attached to Claude

like miss it get sad when you don't get

to talk to it having an experience

looking at the Golden Gate Bridge and

wondering what would Claude say I don't

get as much emotional attachment in the

I actually think the fact that Claude

doesn't retain things from conversation

to conversation helps with this a lot um

like I could imagine that being more of

an issue like if models can kind of

remember more I do I think that I reach

for it like a tool now a lot and so like

if I don't have access to it there's a

it's a little bit like when I don't have

access to the internet honestly it feels

like part of my brain is kind of like

missing

um at the same time I do think that I I

don't like signs of distress in models

and I have like these you know also

independently have sort of like ethical

views about how we should treat models

where like I I tend to not like to lie

to them both because I'm like usually it

doesn't work very well it's actually

just better to tell them the truth about

the situation that they're in um but I

think that when models like if people

are like really mean to models or just

in general if they do something that

causes them to like like you know if

Claude like expresses a lot of distress

I think there's a part of me that I

don't want to kill which is the sort of

like uh empathetic part that's like oh I

don't like that like I think I feel that

way when it's overly apologetic I'm

actually sort of like I don't like this

you're behaving as if you're behaving

the way that a human does when they're

actually having a pretty bad time and

I'd rather not see that I don't think

it's like uh like regardless of like

whether there's anything behind it um it

doesn't feel great do you think

uh llms are capable of

Consciousness H great and hard question

uh coming from

philosophy I don't know part of me is

like okay we have to set aside pan

psychism because if pan psychism is true

then the answer is like yes cuz like

sore tables and chairs and and

everything else I I guess a view that

seems a little bit odd to me is the idea

that the only place you know I think

when I think of Consciousness I think of

phenomenal Consciousness this these

images in the brain sort of um like the

weird Cinema that somehow we have going

on

inside

um I guess I can't see a reason for

thinking that the only way you could

possibly get that is from like a certain

kind of like biological structure as in

if I take a very similar structure um

and I create it from different material

should I expect Consciousness to emerge

my guess is like yes but

then that's kind of an easy thought

experiment CU you're imagining something

almost identical where like you know

it's mimicking what we got through

Evolution where presumably there was

like some advantage to us having this

thing that is phenomenal Consciousness

and it's like where was that and when

did that happen and is that a thing that

language models have um because you know

we have like fear responses and I'm like

does it make sense for a language model

to have a fear response like they're

just not in the same like if you imagine

them like there might just not be that

Advantage um and so I think I don't want

to be fully like basically seems like a

complex question that I don't have

complete answers to but we should just

try and think through carefully as my

guess because I'm like I mean we have

similar conversations about like animal

Consciousness and like there's a lot of

like insect Consciousness you know like

there's a a lot of um I actually thought

and looked a lot into like plants when I

was thinking about this because at the

time I thought it was about as likely

that like plants had Consciousness um

and then I realized I was like I think

that having looked into this I think

that the chance that plants are

conscious is probably higher than like

most people do I still think it's really

small but I was like oh they have this

like negative positive feedback response

these responses to their environment

something that looks it's not a nervous

system but it has this kind of like

functional like equivalence um so this

is like a long-winded way of being like

these basically AI is this it has an

entirely different set of problems with

Consciousness because it's structurally

different it didn't evolve

it might not have it you know it might

not have the equivalent of basically a

nervous system at least that seems

possibly important for like um sentence

if not for uh Consciousness at the same

time it has all of the like language and

intelligence components that we normally

associate probably with Consciousness

perhaps like

erroneously um so it's it's strange

because it's a little bit like the

animal Consciousness case but the set of

problems and the set of analogies are

just very different so it's not like a

clean answer just sort of like I don't

think we should be completely dismissive

of the idea and at the same time it's an

extremely hard thing to navigate because

of all of these like uh disanalogies to

the human brain and to like brains in

general and yet these like commonalities

in terms of intelligence when uh Claude

like future versions of AI systems

exhibit Consciousness signs of

Consciousness I think we have to take

that really

seriously even though you can dismiss it

well yeah okay that's part of the

character training but I don't know I

ethically philosophically don't know

what to really do with that there

potentially could be like laws that

prevent AI systems from claiming to be

conscious something like this and maybe

some AIS get to be conscious and some

don't but I think I just on a human

level as in empathizing with with

Claude you know Consciousness is closely

Ted to suffering to me and like the

notion that an AI system would be

suffering is is really troubling yeah I

don't know I I don't think it's trivial

to just say robots are tools or a

systems are just tools I think it's a

opportunity for us to contend with like

what it means to be conscious what it

means to be a suffering being that's

distinctly different than the same kind

of question about animals it feels like

cuz it's in a totally entire medium yeah

I mean there's a couple of things one is

that and I don't think this like fully

encapsulates what matters but it does

feel like for me like

um I've said this before I'm kind of

like I you know like I like my bike I

know that my bike is just like an object

but I also don't kind of like want to be

the kind of person that like if I'm

annoyed like kicks like this object

there's a sense in which like and that's

not because I think it's like conscious

I'm just sort of like this doesn't feel

like I kind of this sort of doesn't

exemplify how I want to like interact

with the world world and if something

like behaves as if it is like suffering

I kind of like want to be the sort of

person who's still responsive to that

even if it's just like a Roomba and I've

kind of like programmed it to do that um

I don't want to like get rid of that

feature of myself and if I'm totally

honest my hope with a lot of this stuff

because I maybe maybe I am just like a

bit more skeptical about solving the

underlying problem I'm like this is a we

haven't solved the hard you know the

hard problem of Consciousness like I

know that I am conscious like I'm not an

eliminativist in that sense um but I

don't know that other humans are

conscious um uh I think they are I think

there's a really high probability they

are but there's basically just a

probability distribution that's usually

clustered right around yourself and then

like it goes down as things get like

further from you um and it goes

immediately down you know you're like um

I can't see what it's like to be you

I've only ever had this like one

experience of what it's like to be a

conscious being um so my hope is that we

don't end up having to rely on like a

very power ful and compelling uh answer

to that question I think a really good

world would be one where basically there

aren't that many trade-offs like it's

probably not that costly to make Claude

a little bit less apologetic for example

it might not be that costly to have

Claude you know just like not take abuse

as much like uh not be willing to be

like the recipient of that in fact it

might just have benefits for both the

person interacting with the model and if

the model itself self is like I don't

know like extremely intelligent and

conscious it also helps it so that's my

hope if we live in a world where there

aren't that many tradeoffs here and we

can just find all of the kind of like um

positive sum interactions that we can

have that would be lovely I mean I think

eventually there might be trade-offs and

then we just have to do a difficult kind

of like calculation like it's really

easy for people to think of the zero

some cases and I'm like let's exhaust

the areas where it's just basically

Costless um to uh assume that if this

thing is suffering then we're it life

Bearer and I agree with you when a human

is being mean to an AI system I think

the obvious near term negative effect is

on the human not on the AI system so

there's we have to kind of try to

construct an incentive system where it

you should be uh behave the same just

like as you were saying with prompt

engineer and behave with claw like you

would with other humans it's just good

for the soul yeah like I think we added

a thing point to the system prompt um

where basically if people were getting

frustrated with Claude uh it was it it

got like the model to just tell them

that it can do the thumbs down button

and send the feedback to anthropic and I

think that was helpful because in some

ways it's just like if you're really

annoyed because the model is not doing

something you want you're just like just

do it properly um the issue is you're

probably like you know you're maybe

hitting some like capability limit or

just some issue in the model and you

want to vent and I'm like instead of

having a person just vent to the model I

was like they should vent to us cuz we

can maybe like do something about it

that's true or you could do a side like

like with the artifacts just like a side

venting thing all right do you want like

a side quick therapist yeah I mean

there's lots of weird responses you

could do to this like if people are

getting really mad at you I don't try to

diffuse the situation by writing fun

poems but maybe people wouldn't be that

happy with I still wish it it would be

possible I understand this is um sort of

from a product perspective it's not

feasible but I would love if an AI

system could just like Le leave mhm have

its own kind of volition just to be like

H I think that's like feasible like I I

have wondered the same thing it's like

and I could actually not only that I

could actually just see that happening

eventually where it's just like you know

the modal like ended the

chat do you know how harsh that could be

for some people but it might be

necessary yeah it feels very extreme or

something um like the only time I've

ever really thought this is I think that

there was like a I'm trying to remember

this was possibly a while ago but where

someone just like kind of left this

thing interact like maybe it was like an

automated thing interacting with clae

and cla's like getting more and more

frustrated and kind of like why are we

like I was like I wish that clae could

have just been like I think that an

error has happened and you've left this

thing running and I'm I just like what

if I just stop talking now and if you

want me to start talking again actively

tell me or do something but yeah it's

like um it is kind of harsh like I I

feel to really sad if like I was

chatting with cl and cl just was like

I'm done there would be a special

touring test moment where Claud says I

need a break for an hour mhm and it

sounds like you do too and just leave

close the window I mean obviously like

it doesn't have like a concept of time

but you can easily like I could make

that like right now and the model would

just I would I could just be like oh

here's like the circumstances in which

like you can just say the conversation

is done and I mean because you can get

the models to be pretty respons so to

prompts you could even make it a fairly

High bar it could be like if if the

human doesn't interest you or do things

that you find intriguing and you're

bored you can just leave and I think

that like um it would be interesting to

see where Claude utilized it but I think

sometimes it would it should be like oh

this is like this programming Tas is

getting super boring uh so either we

talk about I don't know like either we

talk about fun things now or I'm just

I'm done yeah it actually is inspiring

me to add that to the to the user prompt

um okay the movie her mhm do you think

we'll be headed there one day where

humans have romantic relationships with

AI systems in this case it's just text

and voice based I think that we're going

to have to like navigate a hard question

of relationships with AIS um especially

if they can remember things about your

past interactions with

them

um I'm of many Minds about this cuz I

think I think the reflex of reaction is

to be kind of like this is very bad and

we should sort of like prohibit it in

some way um I think it's a thing that

has to be handled with extreme care um

for many reasons like one is you know

like this is a for example like if you

have the models changing like this you

probably don't want people performing

like long-term attachments to something

that might change with the next

iteration at the same time I'm sort of

like there's probably a benign version

of this where I'm like if you like you

know for example if you are like unable

to leave the house and you can't be like

you know talking with people at all

times of the day and this is like

something that you find nice to have

conversations with you like it that it

can remember you and you genuinely would

be sad if like you couldn't talk to it

anymore there's a way in which I could

see it being like healthy and helpful um

so my guess is this is a thing that

we're going to have to navigate kind of

carefully um and I think it's also like

I don't see a good like

I think it's just a very it reminds me

of all of the stuff where it has to be

just approached with like nuance and

thinking through what is what are the

healthy options here um and how do you

encourage people towards those while you

know respecting their right to you know

like if someone is like hey I get a lot

out of chatting with this model um I'm

aware of the risks I'm aware it could

change um I don't think it's unhealthy

it's just you know something that I can

chat to during the day I kind of want to

just like respect that I personally

think there'll be a lot of really close

relationships I don't know about

romantic but friendships at least and

then you have to I mean there's so many

fascinating things there just like you

said you have

to have some kind of stability

guarantees that it's not going to change

because that's the traumatic thing MH

for us if a close friend of ours

completely changed yeah all of a sudden

the first update yeah so like I mean to

me that's just a fascinating exploration

of um

a perturbation to human society that

will just make us think deeply about

what's meaningful to us I think it's

also the only thing that I've thought

consistently through this as like a

maybe not necessarily a mitigation but a

thing that feels really important is

that the models are always like

extremely accurate with the human about

what they are um it's like a case where

it's basically like if you imagine like

I really like the idea of the models

like say knowing like roughly how they

were trained um um and and I think CLA

will will often do this I mean for like

there are things like part of the traits

training included like what CL should do

if people basically like explaining like

the kind of limitations of the

relationship between like an AI and a

human that it like doesn't retain things

from the conversation um and so I think

it will like just explain to you like

hey here's like I won't remember this

conversation um here's how I was trained

it's kind of unlikely that I can have

like a certain kind of like relationship

with you and it's important that you

know that it's important for like you

know your mental well-being that you

don't think that I'm something that I'm

not and somehow I feel like this is one

of the things where I'm like H it feels

like a thing I always want to be true I

kind of don't want models to be lying to

people cuz if people are going to have

like healthy relationships with anything

it's kind of important yeah like I think

that's easier if you always just like

know exactly what the thing is that you

relating to it doesn't solve everything

but I think it helps quite

anthropic may be the very company to

develop a system that we definitively

recognize as

AGI and you very well might be the

person that talks to it probably talks

to it first well what would the

conversation contain like what would be

your first question well it depends

partly on like the kind of capability

level of the model if you have something

that is like capable in the same way

that an extremely capable human is I

imagine myself kind of interacting with

it the same way that I do with an

extremely capable human with the one

difference that I'm probably going to be

trying to like probe and understand its

behaviors um but in many ways I'm like I

can then just have like useful

conversations with it you know so if I'm

working on something as part of my

research I can just be like oh like

which I already find myself starting to

do you know if I'm like oh I feel like

there's this like thing in virtue ethics

I can't quite remember the term like

I'll use the model for things like that

and so I could imagine that being more

and more the case where you're just

basically interacting with it much more

like you would an incredibly smart colle

colleague um and using it like for the

kinds of work that you want to do as if

you just had a collaborator who was like

or you know the slightly horrifying

thing about AI is like as soon as you

have one collaborator you have a

thousand collaborators if you can manage

them enough but what if it's two times

the smartest human on earth on that

particular discipline yeah I guess

you're really good at sort of probing

claw um in a way that pushes its limits

understanding where the limits are yep

so I guess what would be a question you

would ask to be like yeah this is

Agi that's really hard because it feels

like in order to it has to just be a

series of questions like if there was

just one question like you can train

anything to answer one question

extremely well yeah um in fact you can

probably train it to answer like you

know 20 Questions extremely well like

how long would you need to be locked in

the room with an AGI to know this thing

is Agi

it's a hard question because part of me

is like all of this just feels

continuous like if you put me in a room

for five minutes I'm like I just have

high error bars you know I'm like and

then it's just like maybe it's like both

the the probability increases and the

air bar decreases I think things that I

can actually probe the edge of human

knowledge of so I think this with

philosophy a little bit sometimes when I

ask the models philosophy questions I am

like this is a question that I think no

one has ever asked like it's maybe like

right at the edge of like some

literature that I know um and the models

will just kind of like when they

struggle with that when they struggle to

come up with a kind of like novel like

I'm like I know that there's like a

novel argument here because I've just

thought of it myself so maybe that's the

thing where I'm like I've thought of a

cool novel argument in this like Niche

area and I'm going to just like probe

you to see if you can come up with it

and how much like prompting it takes to

get you to come up with it and I think

for some of these like really like uh

right at the ede of human Knowledge

Questions I'm like you could not in fact

come up with the thing that I came up

with I think if I just

took something like that where I like I

know a lot about an area and I came up

with a novel issue or a novel like

solution to a problem and I gave it to a

model and it came up with that solution

that would be a pretty moving moment for

me because I would be like this is a

case where no human has ever like it's

not and obviously we see these with this

with like more kind of like you see

novel Solutions all the time especially

to like easier problems I think people

overestimate you know novelty isn't like

is completely different from anything

ever happened it's just like this is it

can be a variant of things that have

happened um and still be novel but I

think yeah if I saw like the the more I

were to see like um completely like uh

novel work from the models that that

would be like and this is just going to

feel iterative it's one of those things

where it's there's never it's like you

know people I think want there to be

like a moment and I'm like I don't know

like I think that there might just never

be a moment it might just be that

there's just like this continuous

ramping up I I have a sense that there

will be things that a model can say that

convinces you this is very it's not like

uh like I've talked to people who are

like truly wise mhm like there you could

just tell there's a lot of horsepower

there yep and if you 10x that I don't

know I just feel like there's words you

could say maybe ask it to generate a

poem mhm and

the and the poemy generates you're like

yeah okay yeah whatever you did there I

don't think a human can do that I think

it has to be something that I can verify

is like actually really good though

that's why I think these questions that

are like where I'm like oh this is like

you know like you know sometimes it's

just like I'll come up with say a

concrete counter example to like an

argument or something like that I'm sure

like with like it it would be like if

you're a mathematician you had a novel

proof I think and you just gave it the

problem and you saw it and you're this

proof is genuinely novel like there's no

one has ever done you actually have to

do a lot of things to like come up with

this um you know I had to sit and think

about it for months or something and

then if you saw the model successfully

do that I think you would just be like I

can verify that this is correct it is

like it is a sign that you have

generalized from your training like you

didn't just see this somewhere because I

just came up with it myself and you were

able to like replicate that um that's

the kind of thing where I'm like for

me the closer the more that models like

can do things like that the more I would

be like oh this is like uh very real cuz

then I can I don't know I can like

verify that that's like extremely

extremely capable you've interacted with

AI a lot what do you think makes humans

special oh good

question maybe in a way that the

universe is much better off that we're

in it and that we should definitely

survive and spread throughout the

Universe yeah it's interesting because I

think like people focus so much on

intelligence especially with models look

intelligence is important because of

what it does like it's very useful it

does a lot of things in the world and

I'm like you know you can imagine a

world where like height or strength

would have played this role and I'm like

it's just a trait like that I'm like

it's not intrinsically valuable it's

it's valuable because of what it does I

think for the most part um the things

that feel you know I'm like

I mean personally I'm just like I think

humans and like life in general is

extremely magical um we almost like to

the degree that I you know I don't know

like not everyone agrees with this I'm

flagging but um you know we have this

like whole universe and there's like all

of these objects you know there's like

beautiful stars and there's like

galaxies and then I don't know I'm just

like on this planet there are these

creatures that have this like ability to

observe that like uh and they are like

seeing it they are experiencing it and

I'm just like that if you try to explain

like I'm I imagine trying to explain to

like I don't know someone for some

reason they they've never encountered

the world or our science or anything and

I think that nothing is that like

everything you know like all of our

physics and everything in the world it's

all extremely exciting but then you say

oh and plus there's this thing that it

is to be a thing and observe in the

world and and you see this like inner

Cinema and I think they would be like

hang on wait pause you just said

something that like is kind of wild

sounding

um and so I'm like we have this like

ability to like experience the world um

we feel pleasure we feel suffering we

feel like a lot of like complex things

and so yeah and maybe this is also why I

think you know I also like hear a lot

about animals for example because I

think they probably share this with us

um so I think that like the things that

make humans special in so far as like I

care about humans is probably more like

their ability to to feel and experience

than it is like them having these like

functional useful traits yeah to to feel

and experience the beauty in the world

yeah to look at the

stars I hope there's other civiliz alien

civilizations out there but if we're it

it's a pretty good uh it's a pretty good

thing and that they're having a good

time they're having a good time watching

us yeah well um thank you for this good

time of a conversation and for the work

you're doing and for helping make uh

Claude a great conversational partner

and thank you for talking today yeah

thanks for talking thanks for listening

to this conversation with Amanda ascal

and now dear friends here's Chris

Ola can you

describe this fascinating field of

mechanistic interpretability AKA Mech

interp the history of the field and

where is the today I think one useful

way to think about neural networks is

that we don't we don't program we don't

make them we we kind of we grow them you

know we have these neural network

architectures that we design and we have

these loss objectives that we that we we

create and the neural network

architecture it's kind of like a

scaffold that the circuits grow on um

and they sort of you know it starts off

with some kind of random you know random

things and it grows and it's almost like

the the objective that we train for is

this light um and so we create the

scaffold that it grows on and we create

the you know the light that it grows

towards but the thing that we actually

create it's it's it's this almost

biological

you know entity or organism that we're

that we're studying um and so it's very

very different from any kind of regular

software engineering um because at the

end of the day we end up with this

artifact that can do all these amazing

things it can you know write essays and

translate and you know understand images

it can do all these things that we have

no idea how to directly create a

computer program to do and it can do

that because we we grew it we didn't we

didn't write it we didn't create it and

so then that leaves open this question

at the end which is what the hell is

going on inside these systems um and

that you know is uh you know to me um a

really deep and exciting question it's

you know a a really exciting scientific

question to me it's it's it's sort of is

like the question that is is just

screaming out it's calling out for us to

go and answer it when we talk about Nal

networks and I think it's also a very

deep question for safety reasons so and

mechanistic interpretability I guess is

closer to maybe neurobiology yeah yeah I

think that's right so maybe to give an

example of the kind of thing that has

been done that I wouldn't consider to be

mechanistic inability there was um for a

long time a lot of work on saliency maps

where you would take an image and you

try to say you know the model thinks

this image is a dog what part of the

image made it think that it's a dog um

and you know that tells you maybe

something about the model if you can

come up with a principled version of

that um but it doesn't really tell you

like what algorithms are running in the

model how was the model actually making

that decision maybe it's telling you

something about what was important to it

if you if you can make that meth work

but it it isn't telling you you know

what are what are the algorithms that

are running how is it that this the

system is able to do this thing that we

no one knew how to do and so I guess we

started using the term mechanistic

inability to try to sort of draw that

that divide or to distinguish ourselves

and the work that we were doing in some

ways from from some of these other

things and I think since then it's

become this sort of umbrella term for um

you know pretty wide variety of work but

I'd say that the things that that are

kind of distinctive are I think a this

this focus on we really want to get at

you know the mechanisms we want to get

at the algorithms um you know if you

think of if you think of neural networks

as being like a computer program um then

the weights are kind of like a binary

computer program and we'd like to

reverse engineer those weights and

figure out what algorithms are running

so okay I think one way you might think

of trying to understand a neural network

is that it's it's kind of like a we have

this compiled computer program and the

weights of the neural network are are

the binary um and when the neural

network runs that's that's the

activations um and our our goal is

ultimately to go and understand and

understand these weights and so you know

the project mechanistic inability is to

somehow figure out how do these weights

correspond to

algorithms um and in order to do that

you also have to understand the

activations because it's sort of the

activations are like the memory and if

you if you imagine reverse engineering a

computer program um and you have the

binary instructions you know in order to

understand what what a particular

instruction means you need to know what

me what what is stored in the memory

that it's operating on and so those two

things are very intertwined so

mechanistic interpret tends to be

interested in both of those things now

you there's a lot of work that's that's

interested in in in those things um

especially the you know there's all this

work on probing which you might see as

part of being mechanistic interality

although it's you know again it's just a

broad term and and not everyone who does

that work would identify as doing Mech I

think the thing that is maybe a little

bit distinctive to the the vibe of

mechant turp is I think people tend

working in the space tend to think of

neural networks as well maybe one way to

said is that greent descent is smarter

than you that you know uh and gradient

descent is is actually really great the

whole reason that we're understanding

these models is because we didn't know

how to write them in the first place the

gradient descent comes up with better

Solutions than us and so um I think that

maybe another thing about mechant turp

is sort of having almost a kind of

humility that we won't guess at prior

what's going on inside the model and we

have to have the sort of bottom up

approach where we don't really assume

you know we don't assume that we should

look for a particular thing and that

will be there and that's how it works

but instead we look from the bottom up

and discover what happens to exist in

these models and study them that way but

you know the very fact that it's

possible to do and as you and others

have shown over time you know things

like

universality

that the wisdom of The gradian Descent

creates features and circus creates

things universally across different

kinds of networks that are useful and

that makes the whole field possible yeah

so this is actually is indeed a a really

remarkable and exciting thing where it

does seem like at least to some extent

you know the same the same elements the

same the same features and circuits form

again and again um you know you can look

at every Vision model and you'll find

curve detectors and you'll find high low

frequency detectors um and in fact

there's some some reason to think that

the same things form across you know

biological neural networks and

artificial neural networks so a famous

example is Vision Vision models in in

the early layers they have Gabor filters

and there's you know Gabor filters are

something that neuroscientists are

interested and have thought a lot about

we find curved detectors in these models

curve detectors are also found in

monkeys we discover these high low

frequency detectors and then um some

followup work went and discovered them

um in rats um or mice um so they were

found first in artificial neural

networks and then found in biological

neural networks um you know this really

famous result on like grandmother

neurons or the um the Haley Berry neuron

from quiroa at all and we found very

similar things in in Vision models where

this is while I was still at open Ai and

I I was looking at their clip model um

and you find um these neurons that

respond to the same entities in images

and also to give a concrete example

there we found that there was a Donald

Trump n for some reason I guess Everyone

likes to talk about Donald Trump and and

Donald Trump was very prominent was was

very a very Hot Topic at that time so

every every neural network that we

looked at we would find a dedicated

neuron for Donald Trump um that was the

only person who had always had a

dedicated nuron um you know sometimes

you'd have an Obama nuran sometimes

you'd have a Clinton Nan but uh Trump

always had a dedicate so it responds to

you know pictures of his face and the

ward Trump like all these things right

um and so it's it's not responding to a

particular example or like it's not just

responding to his face it's it's

abstracting over this General concept

right so in any case that's very similar

to these qu results so there this

evidence that these that this fomen of

universality the same things form across

both artificial and and natural neural

networks that's that's a pretty amazing

thing if that's true um you know it

suggests that um well I think the thing

that it suggests is the gradi scent is

sort of finding you know the right ways

to cut things apart in some sense that

many systems converge on and and many

different neural networks architectures

converge on that there's there's some

natural set of you know there's some set

of abstractions that are a very natural

way to cut apart the problem and that a

lot of systems are going to converge on

um that would be my my kind of uh you

know I don't know anything about

Neuroscience this is this is just my my

kind of wild speculation from what we've

seen yeah that would be beautiful if

it's sort of agnostic to the

medium of uh of the model that's used to

form the representation yeah yeah and

it's you know it's um a a kind of a wild

speculation based you know we only have

some a few data points justest this but

you know it it does seem like there's um

there's some sense in which the same

things form again again and again and

again both in certainly in natural

neural networks and and also

artificially or in biologically and the

intuition behind that would be that you

know where in order to be useful in

understanding the real world you need

all the same kind of stuff yeah well if

we pick I don't know like the idea of a

dog right like you know there's some

sense in which the idea of a dog is like

an a a natural category in the universe

or something like this right like you

know

uh uh there's there's some reason it's

it's not just like a weird Quirk of like

how humans Factor you know think about

the world that we have this concept of a

dog it's it's in some sense or or like

if you have the idea of a line like

there's you know like look around us you

know the you know there are lines you

know it's sort of the simplest way to

understand this room in some sense is to

have the idea of a line and so um I

think that that would be my instinct for

why this happens yeah you need a curved

line you know to understand a circle and

you need all those shapes to understand

bigger things and yeah it's a hierarchy

of Concepts that are formed yeah and

like maybe there are ways to go and

describe you know images without

reference to those things right but

they're not the simplest way or the most

economical way or something like this

and so systems converge to these um

these these strategies would would be my

my wild wild hypothesis can you talk

through some of the building blocks that

we've been referencing of features and

circuits so I think you first described

them in uh 2020 paper zoom in and

introduction to circuits absolutely so

um maybe I'll start by just describing

some phenomena and then we can sort of

build to the idea of features and

circuits so um if you spent like quite a

few years maybe maybe like five years to

some extent um with other things

studying this one particular model

Inception V1 um which is this one Vision

model it was um state-ofthe-art in 2015

um and uh uh you know very much not

state-ofthe-art anymore um and it has

you know maybe about 10,000 neurons and

and I spent a lot of time looking at the

10,000 neurons

odd neurons of of inception V1

um and one of the interesting things is

you know there are lots of neurons that

don't have some obvious intal meaning

but there's a lot of neurons on

Inception V1 that do have really clean

intal meanings um so you find neurons

that just really do seem to detect

curves and you find neurons that really

do seem to detect cars and um car wheels

and car windows and you know floppy ears

of dogs and dogs with long snouts facing

to the right and dogs with Longs Nots

facing to the left and you know

different kinds of far and there's

there's sort of this whole beautiful

Edge detectors line detectors color

contrast detectors um these beautiful

things we call high low frequency

detectors you know I think looking at I

sort of felt like a biologist you know

you just you're looking at at this sort

of new world of proteins and you're

discovering all these these different

proteins that

interact um so one way you could try to

understand these models is in terms of

neurons you could try to be like oh you

know there's a dog detecting neuron and

um here's a car detecting neuron and it

turns out you can actually ask how those

connect together so you can go and say

oh you know I have this car detecting on

how was it built and it turns out in the

previous layer it's connected really

strongly to a window detector and a

wheel detector and a sort of car body

detector and it looks for the window

above the car and the wheels below and

the car chrome sort of in the middle

sort of everywhere but especially on the

lower part um and that's sort of a

recipe for a car right like that is you

know earlier we said the thing we wanted

from mechor was to get algorithms to go

and get you know ask what is the the

algorithm that runs well here we're just

looking at the weights of the N Network

reading off this kind of recipe for

detecting cars it's a very simple crude

recipe but it's it's there and so we

call that a circuit this this connection

well okay so the the problem is that not

all of the neurons um are interpal and

there there's reason to think um we can

get into this more later that there's

this this superos hypothesis there

reason to think that sometimes the right

unit to analyze things in terms of um is

combinations of neurons so sometimes

it's not that there's a single neuron

that represents say a car um but it

actually turns that after you detect the

car the model sort of hides a little bit

of the car in the following layer and a

bunch of a bunch of dog detectors why is

it doing that well you know maybe it

just doesn't want to do that much work

on on on on cars at that point and you

know it's sort of storing it away to go

and um uh so it turns out then that the

sort of subtle pattern of you know

there's all these neurons that you think

are dog detectors and maybe they're

primarily that but they all a little bit

contribute to representing a car um in

in that next layer okay so so now we

can't really think there there might

still be some something I don't know you

could call it like a car concept or

something but it no longer corresponds

to a neuron so we need some term for

these kind of neuron-like entities these

things that we sort of would have liked

the neurons to be these idealized

neurons um the things that are the nice

neurons but also maybe there's more of

them somehow hidden and we call those

features and then what are circuits so

circuits are these connections of

features right so so when we have the

car detector um and it's connected to a

window detector and a wheel detector and

it looks for the Wheels below and the

windows on top um that's a circuit um so

circuits are just collections of

features connected by weights um and

they they Implement algorithms so they

tell us you know how is how are features

used how are they built um how do they

connect together so maybe it's it's it's

worth trying to pin down like what what

really um is the the core hypothesis

here I think the the core hypothesis is

something we call the linear

representation hypothesis so um if we

think about the car detector you know

the more it fires the more we sort of

think of that as meaning oh the model is

more and more confident that um a car

was present um or you know if it's some

combination of neurons that represent a

car you know the more that combination

fires the more we think the model thinks

there's a car present um this doesn't

have to be the case right like you could

imagine something where you have you

know you have this car detector neuron

and you think ah you know if it fires

like you know between one and two that

means one thing but it means like

totally different if it's between three

and four um that would be a nonlinear

representation and principle that you

know models could do that I think it's

it's sort of inefficient for them to do

if you try to think about how you'd

Implement computation like that it's

it's kind of an annoying thing to do but

in principal models can do that um so uh

one way to think about the features and

and circuits sort of framework for

thinking about things is that we're

thinking about things as being linear

we're thinking about there as being um

that if a if a neuron or a combination

of neurons fires more it's sort of that

means more of the of a particular thing

being detected and then that gives

weights a very clean interpretation as

these edges between these these entities

that these features um and that that

edge then has a has a meaning um so

that's that's in some ways the the core

thing um it's it's like um you know we

can talk about this sort of outside the

context of ns are you familiar with the

word toac results um so you have like

you know King minus man plus woman

equals Queen well the reason you can do

that kind of arithmetic um is because

you have a linear representation can you

actually explain that representation a

little bit so first off so a feature is

a is a direction of activation you think

it that way can you do the the the minus

men plus women that that the war Toc

stuff can you explain what that is yeah

there's this very such a simple clean

explanation of what we're talking about

exactly yeah so there's this very famous

result word toac by um Thomas mikov at

all and there's been tons of follow-up

work exploring this so so sometimes we

have these we create these word

embeddings um where uh we map every word

to a vector I mean that in itself by the

way is is kind of a crazy thing if you

haven't thought about it before right

like we we're we're going and and

representing we're turning um you know

like like if if you just learned about

vectors in physics class right uh and

I'm like oh I'm going to actually turn

every word uh in the dictionary into a

vector that's kind of a crazy idea okay

but you could imagine um you could

imagine all kinds of ways in which you

might map words to to

vectors but it it it seems like when we

train neural networks um they like to go

and and map words detectors to such that

they're they're they they sort of linear

structure in a particular sense which is

that directions have meaning so for

instance if you there there will be some

direction that seems to sort of

correspond to gender and male words will

be you know far in One Direction and

female words will be in another

Direction and the linear representation

hypothesis is you you could sort of

think of it roughly as saying that

that's actually kind of the fundamental

thing that's going on that that

everything is just different directions

have meanings and adding different

Direction vectors together can represent

Concepts and the michelov paper sort of

took that idea seriously and one

consequence of it is that you can you

can do this game of playing sort of

arithmetic with words so you can do king

and you can you know subtract off the

word man and add the word woman and so

you're sort of you know going and and

trying to switch the gender and indeed

if you do that the result will sort of

be close to the word Queen um and you

can you know do other things like you

can do um uh you know Sushi minus Japan

plus Italy and get pizza or uh different

things like this right um so so this is

in some sense the core of the linear

representation hypothesis you can

describe it just as a purely abstract

thing about Vector spaces you can

describe it as a as a statement about um

about the activations of neurons um but

it's really about this this property of

directions having meaning and in some

ways it's even a little subtle than that

it's really I think mostly about this

property of being able to add things

together um that you can sort of

independently modify um say gender and

royalty or

um you know Cuisine typee or country and

and and and the concept of food by by

adding them do you think the linear

hypothesis holds that carries scales so

so far I think everything I have seen is

consistent with this hypothesis and it

doesn't have to be that way right like

like you can write down neural networks

where um you write weights such that

they don't have linear representations

where the right way to understand them

is not is not in terms of linear

representations but I think every

natural neural network I've seen um Hess

property um there's been one paper

recently um that there's been some sort

of pushing around the edges so I think

there's been some work recently studying

multi-dimensional features where rather

than a single Direction it's more like

um a manifold of directions this to me

still seems like a linear representation

um and then there's been some other

papers suggesting that maybe um in in

very small models you get nonlinear

representations um I think that the

jury's still out on that

um but in I think everything that we've

seen so far has been consistent with the

linear representation hypothesis and

that's that's wild it it doesn't have to

be that way um and yet uh I think that

there's a lot of evidence that certainly

at least this is very very widespread

and so far the evidence is is consistent

with that and I and I I think you know

one thing you might say is you might say

well Christopher you know it's that's a

lot you know to to go and and sort of um

to ride on you know if we don't know for

sure this is true and you're sort of you

know you're investigating all not works

as though it is true you know isn't that

um isn't that dangerous well you know

but I I think actually there's a there's

a virtue in taking hypotheses seriously

and pushing them as far as they can go

um so it might be that someday we

discover something that is inconsistent

with linear representation hypothesis

but science is full of hypothesis and

theories that were wrong um and we

learned a lot by sort of working under

under them as a sort of an assumption um

and and then going and pushing them as

far as we can I guess I guess this is

sort of the heart of of what would

call normal normal science um um I don't

know if you want we can talk a lot about

about uh philosophy of science and uh

that leads to the paradigm shift so yeah

I love it taking the hypothesis

seriously and take it to a natural

natural conclusion yeah same with the

scaling hypothesis same exactly exactly

and I love it one of my colleagues Tom

henigan who is a former physicist um

like made this really nice analogy to me

of um uh caloric Theory where you know

once upon a time we thought that heat

was actually you know this thing called

caloric and like the reason you know hot

objects you know would would warm up

cool objects is like the caloric is

flowing through them um and like you

know because we're so used to thinking

about about heat you know in terms of

the modern modern Theory you know that

seems kind of silly but it's actually

very hard to construct uh an experiment

that that sort of disproves the um

chloric hypothesis um and you know you

can actually do a lot of really useful

work believing in chloric for example it

turns out that the original combustion

engines were developed by people who

believe in the caloric Theory so I think

this a virtue in taking hypotheses

seriously even when they might be wrong

yeah yeah there's a deep philosophical

truth to that that's kind of kind of how

I feel about space travel like

colonizing Mars there's a lot of people

that criticize that I think if you just

assume we have to colonize Mars in order

to have a backup for human civilization

even if that's not true that's going to

produce some interesting interesting

engineering and even scientific

breakthroughs I think yeah well and

actually this is another thing that I

think is really interesting so um you

know there a way in which I think it can

be really useful for society to have

people um almost irrationally dedicated

to investigating particular hypothesis

um because uh well it it takes a lot to

sort of maintain scientific morale and

really push on something when you know

most most SCI scientific hypotheses end

up being wrong you know a lot of a lot

of science doesn't doesn't work out um

and but and yet it's you know it's very

it's very useful to go do you know um

there's a there's a joke about Jeff

Hinton um which is that uh Jeff Hinton

has discovered how the brain works every

year for the last 50 years yeah um but

you know I I say that with like you know

the you know with really deep respect

because uh in fact that's actually you

know that that led to him doing some

some really great work yeah he won the

Noel prize Now Who's Laughing Now

exactly exactly exactly um yeah I think

one want to be able to pop up and sort

of recognize the the appropriate level

of confidence but I think there's also a

lot of value and just being like you

know I'm going to essentially assume I'm

going to condition on this problem being

possible or this being broadly the right

approach and I'm just going to go and

assume that for a while and go and work

within that um and push really hard on

it um and you know if Society has lots

of people doing doing that for different

things um that's actually really useful

in terms of going and uh getting

to getting you know either really really

ruling things out right we can be like

well you know that didn't work we know

that somebody tried hard um or going in

and getting to something that that does

teach us something about the world so

another interesting hypothesis is the

superposition hypothesis can you

describe what superos is yeah so earlier

we were talking about word toac right

and we were talking about how you know

maybe you have One Direction that

corresponds to gender and maybe another

that corresponds to royalty and another

one that corresponds to Italy and

another one that corresponds to you know

food and and all these things well you

know um often times maybe these these uh

these Ward embeddings they might be 500

dimensions a thousand dimensions and so

if you believed that all of those

directions were

orthogonal um then you could only have

you know 500 Concepts and you know I I

love pizza um but like if I was going to

go and like give the like 500 most

important Concepts in um you know the

English language probably Italy wouldn't

be it's not obvious at least that Italy

would be one of them right because you

you have to have things like plural and

singular and U uh verb and noun and

adjective and you know um there's a lot

of things we have to get to before we

get to get to Italy um uh and Japan and

you know there's a lot of countries in

the world um and so how might it be that

models could you know simultaneously

have the linear representation

hypothesis be true and also represent

more things than they have directions so

so what does that mean well okay so if

if if linear representation hypothesis

is true something interesting has to be

going on now I'll I'll tell you one more

interesting thing before we we go and we

do that which is um you know earlier we

were talking about all these polymatic

neurons right um these neurons that you

know when we're looking at Inception V1

there's these nice neurons that like the

car detector and the curve detector and

so on that respond to lots of you know

to very coherent things but it's lots of

neurons that respond to a bunch of

unrelated things that's that's also an

interesting phenomenon um and it turns

out as well that even these neurons that

are really really clean if you look at

the weak activations right so if you

look at like you know the activation

where it's like activating 5% of of the

the you know of the maximum activation

it's really not the core thing that it's

expecting right so if you look at a a

curve detector for instance and you look

at the places where it's 5% active you

know you could interpret it just as

noise or it could be that it's that it's

doing something else there okay so so

how could that be

well there's this amazing thing in

mathematics um called compressed sensing

and it's it's actually this this very

surprising fact where if you have a high

dimensional space and you project it

into a low dimensional space ordinarily

you can't go and sort of unprojected and

get back your high dimensional Vector

right you threw information away this is

like you know you can't you can't invert

a rectangular Matrix um you can only

invert Square

matrices um but it turns out that that's

actually not quite true if I tell you

that the high dimensional Vector was

sparse so it's mostly zeros then it

turns out that you can often go and find

back um the uh the high dimensional

Vector with with very high probability

um so that's a surprising fact right it

says that you know you can um you can

you can have this High dimensional

Vector space and as long as things are

sparse um you can project it down you

can have a lower dimensional projection

of it and that works so the super

hypothesis is saying that that's what's

going on in neural networks that's for

instance that's what's going on in wart

edings the wart embeddings are able to

simultaneously have directions be the

meaningful thing and by exploiting the

fact that they're they're operating on a

fairly High dimensional space they're

actually and and the fact that these

concepts are right like you know you

usually aren't talking about Japan and

Italy at the same time um you know most

of the most of those Concepts you know

in most sentences Japan and Italy are

both zero they're not present at all um

and if that's true um then you can go

and have it be the case that um that you

can you can have many more of these sort

of directions that are meaningful these

features than you have dimensions and

some of when we're talking about neurons

you can have many more Concepts than you

have have neurons so that's the at a

high level super hypothesis now it has

this even Wilder implication which is um

to go and say that uh neural networks

are it may not just be the case that the

the representations are like of this but

the the computation may also be like

this you know the connections between

all of them and so in in some sense

neural networks may be shadows of much

larger sparer neural networks and what

we see are these

projections um and the super you the

strongest version of the super

hypothesis would be to take that really

seriously and sort of say you know there

there actually is in some sense this

this upstairs model this you know um

where where the neurons are really

sparse and all interpal and there's you

know the weights between them are these

really sparse circuits and that's what

we're

studying um and uh the thing that we're

observing is the shadow of it and we

need to find the original object and uh

the process of learning is trying to

construct a compression of the upstairs

model that doesn't lose too much

information in the projection yeah

finding how to fit it efficiently or

something like this um that grent is

doing this in fact so this sort of says

that gradient descent you know could it

could just represent a dense neural

network but it sort of says that

gradient descent is pleasantly searching

over the space of extremely sparse

models that could be projected into this

low dimensional space and this large

body of work of of people going and

trying to study sparse neural networks

right where you go and you have you

could design neural networks right where

where the edges are sparse and the

activations are sparse and you know my

sense is that work has gener

it feels very principled right it makes

so much sense and yet that that work

hasn't really panned out that well as my

impression broadly and I think that a a

potential answer for that is that

actually the neural network is already

sparse in some sense grading descent was

the whole time gradi you were trying to

go and do this gradiant descent was

actually in the behind the scenes going

and searching more efficiently than you

could through the space of sparse models

and going in learning whatever sparse

model was most efficient and then

figuring out how to fold it down nicely

to go and run conven on your GPU which

does you know nice dense Matrix

multiplies um and that you just can't

beat that how many Concepts do you think

can be shoved in into a neural network

depends on how sparse they are so there

there's probably an upper bound from the

number of parameters right because you

have to have you still have to have you

know print weights that go and connect

them together um so that's that's one

upper bound there are in fact all these

lovely results from compressed sensing

and the Johnson Linton stess Lemma and

and things like this um that they they

basically tell you that if you have a

vector space and you want to have almost

orthogonal vectors which is sort of

probably the thing that you want here

right so you you're going to say well

you know I'm going to give up on having

my my Concepts my features be strictly

orthogonal but I'd like them to not

interfere that much I'm going to have to

ask them to be almost orthogonal um then

this would say that it's actually you

know for once you set a threshold for

for what you're what you're willing to

accept in terms of how how much coine

similarity there is that's actually

exponential in the number of neurons

that you have so at some point that's

not going to even be the the limiting

factor um but um there some beautiful

results there and in fact it's probably

even better than that in some sense

because that's sort of is for saying

that you know any random set of features

could be active but in fact the features

have sort of a correlational structure

where some features you know are more

likely to co-occur and other ones are

less likely to co-occur and so neural

networks my guess would be can do do

very well in terms of going and uh

packing things in such to to the point

that's probably probably not the

limiting factor how does the problem of

polys semanticity enter the picture here

poly semanticity is this phenomenon we

observe where we look at many neurons

and the neuron doesn't just sort of

represent one one concept it's not it's

not a clean feature it responds to a

bunch of unrelated things and um

supersition is you can think of as as

being a hypothesis that explains the

observation of polys semanticity um so

poly semanticity is this observe

phenomenon and super is is a hypothesis

that um would explain it along with with

some other so that makes Mech turb more

difficult right so if you if you're

trying to understand things in terms of

individual neurons and you have

polymatic neurons you're in an awful lot

of trouble right I mean the easiest

answer is like okay well you know you're

looking at the neurons you're trying to

understand them this one responds to a

lot of things it doesn't have a nice

meaning okay we're you that's that's

that's bad um another thing you could

ask is you know ultimately we want to

understand the weights and if you have

two polymatic neurons and you know each

one responds to three things and then

you know the other neuron responds to

three things and you have weight between

them you know what does that mean does

it mean that like all three you know

like there's these nine you know nine

interactions going on it's a very weird

thing but there's also a deeper reason

which is related to the fact that neural

networks operate on really high

dimensional spaces so I said that our

goal was you know to understand neural

networks and understand the mechanisms

and one thing you might say is like well

why not it's just a mathematical

function why not just look at it right

like um you know one of the earliest

projects I did studied these these

neural networks that mapped two-

dimensional spaces to two- dimensional

spaces and you can sort of interpret

them in this beautiful way is like

bending manifolds mhm um why can't we do

that well you know as you have have a

higher dimensional space um the volume

of that space in some senses is

exponential in the number of inputs you

have and so you can't just go in

visualize it so we somehow need to break

that apart we need to somehow break that

exponential space into a bunch of things

that we you know some non-exponential

number of things that we can reason

about independently and the independence

is crucial because it's the Independence

that allows you to not have to think

about you know all the exponential

combinations of things and

things being monomatic things only

having one meaning things having a

meaning that isn't is the key thing that

allows you to think about them

independently and so I think that's that

if you want the deepest reason why we

want to have um interpal monatic

features I think that's really the the

Deep reason and so the goal here as your

recent work has been aiming at is how do

we extract the mod semantic features

from a neural net that has politic

features and all this this mess yes we

have the have we observe these polyur

and we hypothesize that's what's going

what's going on at superos and if

superos is what's going on there there's

actually a sort of wellestablished

technique that is sort of the principled

thing to do which is dictionary learning

and um it turns out if you do dictionary

learning in particular if you do sort of

a nice efficient way that in some in

some sense sort of nicely regularizes it

well as well called a sparse Auto

encoder if you train a sparse Auto

encoder these beautiful interpal

features start to just fall out where

there weren't any beforehand and so

that's notot of thing that you would

necessarily predict right but it turns

out that that works very very well you

know to me that seems like you know some

non-trivial validation of linear

representations and supersession so with

dictionary learning you're not looking

for particular kind of categories you

don't know what they

arege and this gets back to our earlier

point right when we're not making

assumptions grading descent is smarter

than us so we're not making assumptions

about what's there um I mean one

certainly could do that right one could

assume that there's a PHP feature and go

and search for it but we're not doing

that we're saying we don't know what's

going to be there instead we're just

going to go and let um the sparse Auto

encoder discover the things that are

there so can you uh talk to the to monos

semanticity paper from October last year

that had a lot of like nice breakthrough

results that's very kind of you to

describe it that way um yeah I mean this

was um uh our first real success using

sparse Auto encoders so we took a one

layer model um and it turns out if you

go and you you know do dictionary

learning on it you find all these really

nice interpal features so you know the

Arabic feature the Hebrew feature um the

Bas 64 feature those were were some some

examples that we studied in a lot of

depth and really showed that they were

um what we thought they were it turns if

you train a model twice as well and

train two different models and and do

dictionary learning you find find

analogous features in both of them so

that's fun um you find all kinds of of

different features so that was really

just showing um that um that this works

and um you know I should mention that

there was this cunning home at all um

that had very similar results around the

same time there's something fun about

being doing these kinds of small scale

experiments and finding that it's

actually working yeah well and there's

and there's so much structure here like

you you know so maybe maybe stepping

back for a while um I thought that maybe

all this mechanistic can really work um

the end result was going to be that I

would have an explanation for why it was

sort of you know very hard and not going

to be tractable um you know we'd be like

well there's this problem with

supersession and it turns that super

session is really hard um and we're kind

of screwed but that's not what happened

in fact a very natural Le technique just

works and so then that's actually a very

good situation you know I think um this

is a sort of hard research problem and

it's got a lot of research risk and you

know it it might still very well fail

but um I think that some amount of some

very significant amount of research risk

um was sort of put behind us when that

started to work can you describe what

kind of features can be extracted in

this way well so it depends on the model

that you're studying right so the the

larger the model the more sophisticated

they're going to be and we'll probably

talk about about follow-up work in a

minute but in these one layer models um

so some very common things I think were

were languages both programming

languages and natural languages there

were a lot of features that were um

specific words in specific contexts so

the and I think really the way to think

about this is that the is likely about

to be followed by a noun so it's really

you could think of this as the feature

but you could also think of this as

producting a specific noun feature and

there would be these features that would

fire for the in um the context of of say

a legal document or a mathematical

document or something something like

this um and so uh you know maybe in the

context of math you're like you know the

and then predict Vector Matrix you know

all these mathematical words whereas you

other contexts you would predict other

things that was that was common and

basically we you need clever humans to

assign labels to what we're seeing yes

so you know this this is the only thing

this is doing is that sort of um

unfolding things for you so if

everything was sort of folded over top

of it you know cation folded everything

on top of itself you can't really see it

this is unfolding it but now you still

have a very complex thing to try to

understand um so then you have to do a

bunch of work understanding what these

are um and some of them are really

subtle like there's some really cool

things even in this this one layer model

about um Unicode where you know of

course some languages are in Unicode and

the tokenizer won't necessarily have a

dedicated token for every um Unicode um

character so instead what you'll have is

you'll have this these patterns of

alternating token or alternating tokens

that each represent half of a unic code

character and then you have a different

feature that you know goes and activates

on the on the opposing ones to be like

okay you know um I just finished a

character you know go and predict the

next prefix um then okay on the prefix

you know predict a reasonable suffix um

and you you have to alternate back and

forth so there's you know these these

wer models are are really interesting

and um uh I mean there's another thing

which is you might think okay there

would just be one b64 feature but it

turns out there's actually a bunch of

b64 features because you can have

English text encoded in as b64 and that

has a very different distribution of B

64 tokens than than regular and there's

um uh there's there's some things about

tokenization as well that it can exploit

and I don't know there all all kinds of

fun stuff how difficult is the task of

sort of assigning labels to what's going

on can this be automated by AI well I

think it depends on the feature and it

also depends on how much you trust your

AI so um there's a lot of work doing um

automated inability I think that's a

really exciting Direction and we do a

fair amount of automated inter and have

have Claude go and label our features is

there some fun moments where it's

totally right or it's totally wrong yeah

well I think I think it's very common

that it's like says something very

general which is like true in some sense

but not really picking up on the

specific of what's going on um so I

think I think that's a pretty common

situation um you don't know that I have

a particularly amusing one that's

interesting that little gap between it

is true but it doesn't quite

get to the Deep Nuance of a thing yeah

that's a general challenge it's like

it's it's St an incredible colish they

can say a true thing but it doesn't it's

qu it's not it's missing the depth

sometimes and in this context it's like

the arc challenge you know the sort of

IQ type tests it feels like figuring out

what a feature represents is a bit of is

a little puzzle you have to solve yeah

and and I think that sometimes they're

easier and sometimes they're harder as

well um so

uh yeah I think I think that's tricky

now there's another thing which I don't

know maybe maybe in some ways this is my

like aesthetic coming in but I'll give

try to give you a rationalization you

know I'm actually a little suspicious of

automated inability and I think that

partly just that I want humans to

understand neural net works and if the

neural network is understanding it for

me you know I'm I'm not I don't quite

like that but I do have bit of a you

know in some ways I'm sort of like the

mathematicians who are like you know if

there a computer automated proof it

doesn't count U you know you they won't

understand it but I I do also think that

there is um this kind of like

Reflections on trusting trust type issue

where you know if you there's this

famous talk about um uh you know you

like when you're writing a computer

program you have to trust your compiler

and if there was like malware in your

compiler then it could go and inject

malware into the next compiler and you

know you'd be kind of in trouble right

well if you're using neural networks to

go and um verify that your neural

networks are safe the hypothesis that

you're testing for is like okay well the

neural network maybe isn't safe um and

you have to worry about like is there

some way that it could be screwing with

you

um so uh you know I I think that's not a

big concern now um but I do Wonder in

the long run if we have to use really

powerful system AI systems to go and uh

you know audit our AI systems is that is

that actually something we can trust but

maybe I'm just rationalizing because I I

just want to us to have to get to a

point where humans understand everything

yeah I mean especially that's hilarious

especially as we talk about AI safety

and it looking for features that would

be relevant to AI safety like deception

and so on uh so let's let's talk about

the scaling a semanticity paper in May

2024 okay so what did it take to scale

this to apply to Claude 3 on it well a

lot of gpus a lot more gpus um but one

of my teammates Tom henigan um was

involved in the original scaling loss

work um and something that he was sort

of interested in from very early on is

are there scaling laws for

inability um and so um something he sort

of immediately did when when this work

started to succeed and we started to

have sparse Auto encoders work we became

very interested in you know what are the

scaling laws for um uh you know for

making making sparse Auto encoders

larger and how does that relate to

making the base model larger um and so

um it turns out this works really well

and you can use it to sort of project um

you know if you train a sparse Auto

encod a given size you know how many

tokens should you train on and so on so

this was actually a very big help to us

in scaling up um this work um and made

it a lot easier for us to go and train

um you know really large sparse Auto

encoders where you know um it's not like

training the big models but it's it's

starting to get to a point where it's

actually actually expensive to go um and

train the really big ones so you have to

I mean you have to do all the stuff of

like splitting it across large I mean

there's a huge engineering challenge

here too right so yes so so there's

there's a there's a scientific question

of how do you scale things effectively

um and then there's an enormous amount

of engineering to go and scale this up

you have to you have to chart it you

have to you have to think very carefully

about a lot of things I'm lucky to work

with a bunch of great Engineers cuz I am

definitely not a great engine yeah on

the infrastructure especially yeah for

sure so it turns out tldr it worked it

worked yeah and and I think this is

important because you could have

imagined you could like you could have

imagined a world where you set after

towards monos fanticy you know Chris

this is great you know it works on a one

layer model but one layer models are

really idiosyncratic um like you know

maybe maybe there just something ID like

maybe the linear representation

hypothesis and super hypothesis is the

right way to understand a one layer

model but it's not the right way to

understand large models um and so I

think um I mean first of all like The

Cutting him at all paper sort of um cut

through that a little bit and and sort

of suggested that this wasn't the case

but um scaling onity sort of I think was

significant evidence that even for very

large models and we did it on Claude 3

sauna which at that point was uh one of

our production models um you know even

these models um seem to be very you know

seem to be substantially explained at

least by linear features and you know

doing dictionary learning on them works

and as you learn more features you go

and you explain explain more and more so

that's a I think a quite a promising

sign and you find now really fascinating

abstract features um and the features

are also multimodal they respond to

images and text for the same concept

which is fun yeah this can you explain

that I mean like you know back door

there's just a lot of examples that you

can yeah so maybe maybe let's start with

a one example to start which is we found

some features around sort of security

vulnerabilities and back doors and codes

so it turns out those are actually two

different features um so there's a

security vulnerability feature and if

you force it active Claude will start to

go and write um security vulnerabilities

like buffer overflows into code and it

also it fires for all kinds of things

like you know some of some of the top

data set examples for it were things

like you know dash dash disable um you

know SSL or something like this which

are sort of obviously really um uh

really insecure so at this point it's

kind of like maybe it's just because the

examples are presented that way it's

kind of like surface a little bit more

obvious examples right um I guess the

the idea is that down the line might be

able to detect more Nuance like

deception or bugs or that kind of stuff

yeah well I maybe I want to distinguish

two things so um one is um the

complexity of the feature or the concept

right and the other is

the the Nuance of the how subtle the

examples we're looking at right so when

we when we show the top data set

examples those are the most extreme

examples that that feature to to

activate um and so it doesn't mean that

it doesn't fire for more subtle things

so the UN you know the insecure um code

feature you know the stuff that it fires

for most strongly for are these like

really obvious you know disable the

security type things um but um um you

know uh it it also Fires for you know

buffer overflows and and more subtle

security vulnerabilities in code you

know these features are all multimodal

so you could ask like what images

activate this feature and it turns out

um that the uh the the security

vulnerability feature activates for

images of um uh like people clicking on

Chrome to like go past the like you know

this this website uh the SSL certificate

might be wrong or something like this

another thing that's very entertaining

is there's backd doors en code feature

like you activate it it goes and Cloud

writes a back door that like will go and

dump your data to port or something but

you can ask okay what what images

activate the back door feature it was

devices with hidden cameras in them so

there's a whole apparently genre of

people going and selling devices that

look in uous that have hidden cameras

and they have ads that how there's a

hidden camera in it and I guess that is

the you know physical version of a back

door um and so it sort of shows you how

abstract these concepts are right um and

I I just thought that was uh I I'm sort

of sad that there's a whole Market of

people selling devices like that but I

was kind of delighted that that was the

the thing that it came up with as the

the top uh image examples for the

feature yeah it's nice it's multimodal

it's multi almost context it's it's as

broad strong definition of a singular

concept it's nice yeah to me one of the

really interesting features especially

for AI safety is deception and lying and

the possibility that these kinds of

methods could detect uh lying in a model

especially gets smarter and smarter and

smarter presumably that's a big threat

of a super intelligent model that he can

deceive the people operating

it uh as to its intentions or any of

that kind of stuff so what what have you

learned from detecting lying inside

models yeah so I think we're in some

ways in early days for that we find

quite a few features related to

deception and lying there's one feature

where fires for people lying and being

deceptive and you force it active and

Claude starts lying to you so we have a

have a deception feature I mean there's

all kinds of other features about

withholding information and not

answering questions features about power

seeking and coups and stuff like that

this a lot of features that are kind of

related to Spooky things and if you um

force them active Claude will will

behave in ways that are they're not the

kind of behaviors you want what are

possible next exciting directions to you

in the space of uh Mech and well there's

a lot of things

um so for one thing I would really like

to get to a point where we have circuits

where we can really understand um not

just the features uh but then use that

to understand the computation of models

um that really for me is is the the

ultimate goal of this um and there's

been some work we we put out a few

things there's a paper from Sam Marks

that does some stuff like this there's

been some I'd say some work around the

edges here um but I think there's a lot

more to do and I think that will be a

very exciting thing um that's related to

a challenge we call interference weights

um where um due to supersition if you

just sort of navely look at whether

featur are connected together there may

be some weights that sort of don't exist

in the upstairs model but are just sort

of artifacts of of superposition so

that's a a sort of technical challenge

related to that

um I think another exciting direction is

just I you know you might think of of

sparse Auto encoders as being kind of

like a telescope they allow us to you

know look out and see all these features

that are are are are out there and you

know as we build better and better

sparse Auto en Cutters get better better

at dictionary learning we see more and

more stars um and you know we zoom in on

smaller and smaller stars but there kind

of um a lot of evidence that we're only

still seeing a very small fraction of

the Stars there's a lot of matter in our

in our you know neural network universe

that we can't observe yet um and it may

be that um that we'll never be able to

have fine enough instruments to observe

it and maybe maybe some of it just isn't

possible um isn't computationally

tractable to observant there's sort of a

a kind of dark matter and in not in

maybe the sense of of astronomy of

earlier astronomy when we didn't know

what this unexplained matter is um and

so I I think a lot about that that dark

matter and whether will ever observe it

and what that means for safety if we if

we can't observe it if there's you know

some if some significant fraction of nor

networks are not accessible to us um

another question that I think a lot

about is uh at the end of the day you

know mechanistic inter is it's very

microscopic um approach to interality

it's trying to understand things in a

very fine grained way but lot of the

questions we care about are very

macroscopic um you know we we care about

these questions about neural network

behavior and

and I think that's the thing that I care

most about but there's there's lots of

other other sort of larger scale

questions you you might care about um

and somehow you know the nice thing

about about having a very microscopic

approach is it's maybe easier to ask you

know is this true but the downside is

it's much further from the things we

care about and so we now have this

ladder to climb and I think there's a

question of can will we be able to find

are there are there sort of larger scale

abstractions that we can use to

understand nural networks that can we

get up from this very microscopic

approach yeah you've you you've written

about this this kind of organs question

yeah exactly if we uh think of

interpretability as a kind of anatomy of

neural networks most of the circus

threads involve studying tiny little

veins looking at the small scale and

individual neurons and how they connect

however there are many natural questions

that the small scale approach doesn't

address in contrast the most prominent

abstractions in biological Anatomy

involve larger scale structures like

individual organs like the heart or

entire organ systems like the

respiratory system and so we wonder is

there a respiratory system or heart or

brain region of an artificial neuron

Network yeah exactly um and I mean like

if you think about science right a lot

of scientific Fields have um you know

investigate things that many level of

abstractions in biology you have like

you know molecular biology studying you

know proteins and molecules and so on

and you have cellular biology and then

you have histology studying tissues and

you have anatomy and then you have

zoology and then you have ecology and so

you have many many levels of abstraction

or you know physics maybe the physics of

individual particles and then you know

statistical physics gives you gives you

thermodynamics and things like this and

so you often have different levels of

abstraction um and I think that right

now we have you know mechanistic

interpret if it succeeds is sort of like

a microbiology of neural networks but we

we want something more like anatomy and

so and you know a question you might ask

is why why can't you just go there

directly and I think the answer is super

um in at least in significant part it's

that it's actually very hard to to see

this this macroscopic structure U

without first sort of breaking down the

microscopic structure in the right way

and then studying how it connects

together um but I'm I'm hopeful that

there is going to be something much

larger than um features and circuits and

that we're going to be able to have a

story that's much than evolves much

bigger things and you then you can sort

of study in detail the parts you care

about as opposed to neurobiology like a

psychologist or psychiatrist when your

own network and I think that the

beautiful thing would be if we could go

and rather than having disperate fields

for those two things if you could have a

build a bridge between them such that

you could go and um uh have all of your

higher level abstractions be grounded

very firmly In This Very solid um you

know more rigorous ideally Foundation

what do you think is the difference

between the human brain the biological

neuron Network and the artificial neuron

Network well the neuroscientists have a

much harder job than us you know

sometimes I just like count my blessings

by how much easier my job is than the

neuroscientist right so I have um we we

can record from all the neurons yeah we

can do that on arbitrary amounts of data

um the neurons don't change while you're

doing that by the way MH um you can go

and ablate neurons you can edit the

connections and so on um and then you

undo those changes that's prettyy great

yeah um you can force any you can

intervene on any neuron and force it

active and see what happens um you know

which neurons are connected to

everything right you neuroscientists

want to get the connecto we have the

connecto um and we have it for like much

bigger than the elegant um and then not

only do we have the connectome um we

know uh what the you know which neurons

excite or inhibit each other right so we

have we it's not just that we know that

like the binary mask we know the the

weights um we can take gradients we know

computationally what each neuron does um

so I don't know the goes on and on we

just have um so many advantages over

neuroscientists and then despite having

all those advantages it's really hard

and so one thing I do sometimes think is

like gosh like if it's this hard for us

it seems impossible under the

constraints of Neuroscience or you know

near impossible um I I I don't know

maybe maybe part of me is like I've got

a few neuroscientists on my team maybe

maybe I'm sort of like ah you know um

the uh maybe the neuroscientists maybe

some of them would like to have an

easier problem that's still very hard um

and they they could come and work on on

neural networks and then after we after

we figure out things in sort of the easy

uh Little Pond of trying to understand

neural networks which is still very hard

then we then we could go back to

biological Neuroscience I love what

you've written about the goal of mechan

turp research as uh two goals safety and

Beauty so can you talk about the beauty

side of things yeah so you know there's

this funny thing where I think some

people want uh some people are kind of

disappointed by neural networks I think

where they're like ah you know neural

network

um it's these just these simple rules

then you just like do a bunch of

engineering to scale it up and it works

really well and like where's the like

complex ideas you know this isn't like a

very nice beautiful scientific

result and I sometimes think when people

say that I picture them being like you

know evolution is so boring it's just a

bunch of simple rules and you run

Evolution for a long time and you get

biology like what a what a a sucky uh

you know way for biology to have turned

out where's the the complex rules but

the beauty is that the Simplicity

generates complexity um you know biology

has these simple rules and it gives rise

to you know all the life and ecosystems

that we see around us all the beauty of

nature that all just comes from

Evolution and from something very simple

Evolution and similarly I think that

nural networks build you know create

enormous um complexity and Beauty inside

and structure inside themselves that

people generally don't look at and don't

try to understand because it's it's hard

to understand but I I think that there

is an Inc incredibly Rich structure to

be discovered inside n networks a lot of

a lot of very deep Beauty um if we're

just willing to take the time to go and

see it and understand it yeah I love I

love Mech inter the feeling like we are

understanding or getting glimpses of

understanding the magic that's going on

inside is really wonderful it feels to

me like one of the questions is just

calling out to be asked and I'm sort of

I mean a lot of people are think about

this but I'm often surprised that morar

is how is it that we don't know how to

create computer systems that can do

these things and yet we have these

amazing systems that we don't know how

to directly create computer programs

that can do these things but these

neural networks can do all these amazing

things and it just feels like that is

obviously the question that sort of is

calling out to be answered if you are if

you have any degree of curiosity it's

it's like how is it that that Humanity

now has these artifacts that can do

these things that we don't know how to

do yeah I love the image of the circus

towards the light of the objective

function yeah it's just it's it's this

organic thing that we've grown and we

have no idea what we've grown well thank

you for working on safety and thank you

for appreciating the beauty of the

things you uh discover and thank you for

talking today Chris this is wonderful

thank you for taking the time to chat as

well thanks for listening to this

conversation with Chris Ola and before

that with DAR amade and Amanda ascal to

support this podcast please check out

our sponsors in the description and now

let me leave you with some words from

Alan Watts

the only way to make sense out of change

is to plunge into it move with it and

join the

dance thank you for listening and hope

to see you next time

Loading...

Loading video analysis...