Can You Teach Claude to be ‘Good’? | Meet Anthropic Philosopher Amanda Askell

By Hard Fork

Summary

Topics Covered

Ads Inevitably Corrupt AI Interfaces
Targeted Ads Erode User Trust
Premium Tiers Widen AI Access Gap
Constitutions Cultivate AI Judgment
AI Consciousness Remains Open Question

Full Transcript

I'm Kevin Roose, a tech columnist at the New York Times.

>> I'm Casey NON FROM PLATFORMER AND THIS IS HARD FORK. THIS WEEK, ads have arrived in chat GPT. How will they change open AI? Then there's a new constitution for claude. Anthropic

philosopher Amanda Ascll is here to talk about how to shape an AI's personality.

I'm going to use some of these techniques on you.

Okay. So today we're talking about ads, specifically ads in chat GPT because late last week, OpenAI announced that they are going to start testing ads in

chat GPT for loggedin adults in the US on the free and the lowcost go tiers of chat GPT.

>> That's right, Kevin. And we'll discuss it right after these ads.

>> No, we already did the ads.

>> Okay, >> so uh at least on my feed, people were reacting to this pretty negatively. I

think a lot of people have gotten accustomed to using chat GBT and other chat bots without a lot of like direct commercial pressures. It's a refreshing

commercial pressures. It's a refreshing break from all of the ads that have been shoveled at us on other platforms for years. And so collectively, I think

years. And so collectively, I think people were just like sort of resigned like ah, you know, we knew that the honeymoon would be over eventually and that we'd be forced to see ads in chatt like we are everywhere else.

>> Yeah. I think people can just remember products that they used that once did not have ads and now do. And no one thinks of the moment that ads arrived as the moment when the product got really good.

>> Yeah. Right. Right. I think there are some exceptions. I mean some people like

some exceptions. I mean some people like Instagram ads for example, but I think mostly people see this as sort of a blight on the internet. Maybe a

necessary blight, but a blight nonetheless. And I think people were

nonetheless. And I think people were also surprised that OpenAI was moving in this direction uh because of some things that Sam Alman has said in the past about how he doesn't like ads and how he

wanted to basically treat this as a last resort for OpenAI. Some people were saying, "Oh, this means that they're in trouble. They need to raise a bunch of

trouble. They need to raise a bunch of money uh you know, so they can keep building out their their data centers and things like that." So Casey, what did you make of OpenAI's announcement about ads?

>> Well, Kevin, on one hand, I think this is inevitable. There's a uh an analyst I

is inevitable. There's a uh an analyst I follow uh Eric Seufort who often says that everything is an ad network and if you have hundreds of millions of people

coming and paying attention to a service every single week, inevitably there's going to just be overwhelming pressure to put ads on it. Also, we know that OpenAI needs revenue, right? This is the

company that has laid out the most ambitious infrastructure investment project in human history. They have

nowhere close to the money needed to build it and we just know that they would not have been able to fulfill their dreams on subscription revenue

alone. That said, as you point out, Sam

alone. That said, as you point out, Sam Alman himself said that ads were going to be a last resort. A great Papa Roach song. And so in this moment, we now are

song. And so in this moment, we now are at the last resort. And so I think it's just interesting that after everything else they tried, eventually they just said, "Look, to do what we need to do,

we've got to sort of break glass. The

emergency is here."

>> Yeah. They said, "Cut my life into pieces because this is my last resort."

>> Yeah.

>> And the question is, will this cut their life into pieces?

>> Yes. So, we're going to get there, but first our disclosures. The New York Times is suing OpenAI, Microsoft, and Perplexity over alleged copyright violations related to the training of large language models. And my boyfriend works at Anthropic. Let's just start

with the actual announcement that they made because they not only said that they were going to start testing ads, they also gave some previews of what these ads are going to be. And if you look at their sort of mockup version of

their ads, it's a kind of bolt on to the chat GPT answer, they've been very clear that this is not going to influence the answer that chat GPT gives, or so they claim. Instead, it's a little banner at

claim. Instead, it's a little banner at the bottom of the answer in the mockup.

Someone is asking Chat GPT for ideas for a dinner party and chatt gives a response and then then at the bottom there's a little sponsored banner for Harvest Groceries uh including a link

where you can go and buy some hot sauce.

And if I could just pause there, I have to say Kevin, I'm already feeling lied to for this reason. They have said to us your query is not going to affect the

advertisement that we're showing you.

And yet here you have someone saying, "I'm I want uh some ideas for cooking Mexican food for my dinner party." And

Chachi PT says, "Well, here's some groceries, including hot sauce." It sure feels like something was being influenced there, right? Like the

message is being tied to the query.

Well, no. So they're they're their response to this would be that there there are two parts of this response.

There's the actual response from the model and then there's the ad. And what

they're saying is not that they won't show you ads that are relevant to the thing that you're asking Chad GPT about.

It's that there's this sort of sacrosanked part of the the actual reply from the model that they are not going to let advertisers pay their way into.

That is what they're claiming anyway.

>> All right. All right.

>> So that example is a much more straightforward ad, the kind we've seen on Google and Facebook and other platforms for many years. Um, the second kind of ad OpenAI uh mocked up for this

announcement was, I think, more interesting because it shows a new way of interacting with ads. So, basically,

it's, you know, a user's planning a trip to Santa Fe. Chat GPT pops up this little sponsor widget from this desert cottages, I don't know, I guess, hotel or resort thing, and it'll present you

with an option where you can go and chat with the advertiser and ask more questions before deciding uh whether or not to make a purchase. Such a relatable question. I think we've all had the

question. I think we've all had the experience of just watching ads on TV and saying, "Why can't I have a conversation with this? I I want to share my thoughts with McDonald's right

now, but I can't." But now you can.

>> Yes. So, let's talk about the ad principles that Open AAI laid out as part of this announcement because I think it sort of gives a sense of the objections that they're trying to get ahead of.

>> Um, there are five principles. They say

mission alignment, answer independence, conversation privacy, choice and control, and long-term value. Basically,

I think they are sensitive to the criticism that putting ads into chat GPT means that they are now going to start uh directing people to more commercial

types of uh use cases, optimizing for engagement, trying to make people spend more time in the app. I think these are very reasonable fears that people including me have. But um this is sort

of their attempt to say, "Well, we're introducing this, but don't worry, your experience of chat is not going to change."

change." >> Yeah. I I was I was talking about this

>> Yeah. I I was I was talking about this story with my friend Alex over the weekend and he said, "You know, I'm so excited about ads and chatt I'm going to tell it my lower back hurts and it will ask me if I've tried mosquite barbecue

sauce." AND LIKE THAT IS THE FEAR, you

sauce." AND LIKE THAT IS THE FEAR, you know.

No, I I I mean, yes, there there will be some initial stumbles about that, but I think the longer term uh worry here is

that ad platforms as they mature and get better and get more data um they tend to sort of try to confuse their users, right? We've seen there's this amazing

right? We've seen there's this amazing uh graphic that I I think about a lot.

uh, Search Engine Land, the the blog that covers uh, Google and other search engines, uh, made this sort of timeline of how Google's ad labels have changed over the years. And it's pretty amazing

cuz like at first when they first introduced ads into Google Search, they were very noticeable. They had sort of like a different color background. Uh,

they really stood out on the page. And

then you just see over time with each successive update, you know, it gets a little closer to the organic search results, eventually they do away with the colored backgrounds. they have this little like yellow ad icon and then that

icon gets smaller and less noticeable and then it sort of just blends in with the organic content. And I think that's the fear here is that while chat GPT may start out with these very clearly

labeled ad modules over time as the commercial pressures get more intense they are just going to have a lot of incentives to blend that advertising content in with the organic responses and make it less noticeable.

>> Yes. And we've already seen this exact trajectory play out at OpenAI. It went

from no ads to ads will be a last resort to ads are now in chat GPT. So if you think that the bargain is not going to change further, uh I have news for you.

>> Totally. And of course now the sort of narrative from open AI that we're hearing is well this is the only way ads are the only way to make a free or

lowcost product accessible to billions of people. Do you have thoughts on that

of people. Do you have thoughts on that narrative? Because that's also something

narrative? Because that's also something that we heard from uh Facebook back in the day. people would constantly be

the day. people would constantly be asking them, oh, like why don't you just charge people to, you know, to join Facebook instead of showing them all these ads. And they would consistently

these ads. And they would consistently say, oh, well, this, you know, that's not scalable. People in, you know,

not scalable. People in, you know, poorer countries can't afford to pay a subscription fee. And so, basically, ads

subscription fee. And so, basically, ads are the only way to reach global scale.

I I think on some level I do agree with this. I think that ads and subscriptions

this. I think that ads and subscriptions are the two core pillars of any media business and open AI is a kind of media business, right? I should also say I

business, right? I should also say I don't hate the examples that they use.

You know, I'm asking Chatbt about uh you know, making dinner and it shows me ads for groceries. I don't think that that's

for groceries. I don't think that that's like horribly corrosive to the user experience. Nor is I want to take a trip

experience. Nor is I want to take a trip and it says, "Well, here's a place where you might stay." I think if I were a student or I were between jobs and this meant that I could get access to better

AI tools or maybe a higher rate limit than I otherwise could get, I would probably take that trade, right? uh $20

a month is a lot for most people, you know, and not to mention like $200 a month for an even higher tier. So, I

think that there is a reason to pursue this. And I think there are ways that it

this. And I think there are ways that it could not be too bad. It has just been my observation that the exact dynamic that you just described always plays out, which is it starts out not all that

bad and then it just progressively gets worse.

>> Right. Yeah. I think we've made peace with ads in a lot of different contexts.

I don't think most people sort of notice or pay attention to them uh when they can tell that they're ads at all. What

I'm watching for, what I'm skeptical of this is whether the actual product and research decisions start uh sort of bending toward engagement maximization.

Like there's this sort of quality that a lot of these, you know, big ad platforms, social networks, uh search engines, etc. have where like eventually once the ad revenue starts really flowing the tail kind of starts wagging

the dog >> and you start making product decisions about um you know how you want to show information to people with the kind of advertising revenue predominant in your

mind. So I think the question is like

mind. So I think the question is like not like are these first couple of ads that we're seeing from open AAI going to be good or not. It's whether like 2 or 3 years from now chat GPT is sort of being

steered in a way uh toward adfriendly topics. Uh and I genuinely just don't

topics. Uh and I genuinely just don't know the answer there. I don't know either, Kevin, but if I had to guess, I would predict that this moment winds up being a pretty significant milestone in

the development of chat GPT in that I think that when you introduce advertising uh in particular, personalized targeted advertising, it just fundamentally changes the

relationship between the product and the user. Think about what personalized

user. Think about what personalized targeted ads did over time to trust in Facebook and Instagram. Think about all the conspiracy theories out there that oh your phone is listening to you. Not

true, by the way. I realize most people still believe that that's true. It's

not. But trust in those products is lower because of the incredibly um intelligent invasive feeling personalization that they were to do inside these products. My prediction is

the AI version of this turns out to be even worse. Right? Think about

even worse. Right? Think about

everything that ChatBT is going to know about you. I think OpenAI is going to

about you. I think OpenAI is going to bump into that creepy line really quickly where it's showing you stuff and maybe it's not even using all that much personalized information, but the user is going to feel that they have shared

so much of their life with OpenAI that those ads that they start getting just start to feel worse and worse. So this

is the dynamic that I am watching is how does it change their relationship of the user base to open AI because I do think that ads can be really corrosive to that.

>> Yeah. Yeah. And at the same time, the ad models that you mentioned have also made those companies uh billions of dollars and made them into some of the biggest companies in the world. So, I think if you're open AI, you're just like staring

at this potential huge bucket of money.

Um, and it's very hard to pass that up, especially when you have such intense capital needs over the next few years. I

should also say like I think this was inevitable given some of the personnel decisions that OpenAI has made. you know

Fiji Simo who is the CEO of applications over there now uh was brought in from Instacart uh before that she was at Meta for many many years and her one of her

you know signal accomplishments there was introducing ads in the mobile news feed uh which made them billions of dollars. So that is the kind of person

dollars. So that is the kind of person that you hire if you are interested in developing a multi-billion dollar ad platform on your product.

>> Yeah. Well, one question I have for you about that is how does this change the competitive landscape generally? You

have Deis Habis saying this week in response to the news that ads are coming to chat, well, we don't have any plans to do that in Gemini. And he sort of took a shot at them. He said maybe they

feel like they need to make more revenue. Left unsaid the fact that he

revenue. Left unsaid the fact that he works for a giant search monopoly that is able to funnel all of their advertising profits in Google into the product. an observation you made on X by

product. an observation you made on X by the way and it was a great it's a great one and so for the moment at least free users of Gemini will be able to enjoy the subsidy that mother Google is giving

them and you're not going to have any of these corrosive effects in that product.

You also have Anthropic which has said basically we truly have no plans to do ads in Claude ever. Like we are primarily going to be selling to businesses and so this is just not our

concern. And for the moment I don't have

concern. And for the moment I don't have any illusions that claude is going to grow to compete with chatbt. But over

time if the experience does get worse in an ad supported chatbot I could see lots of people wanting an alternative. I

think in this sense like Open AI and Google are much more directly competing on ads than OpenAI and Enthropic.

Enthropic has sort of said you guys can fight over consumer. We're going to focus on the enterprise here. I think

it's a really hard fight for Open AI to pick. I mean Google has as you said this

pick. I mean Google has as you said this like enormous established search ad business. They have advertisers all over

business. They have advertisers all over the world who are already spending money on Google whose, you know, details and payment information and workflows already include like Google and its

products. And so I think OpenAI coming

products. And so I think OpenAI coming in and trying to build a Google style ad platform is just like a harder uphill battle than it might have been a couple years ago.

>> Yeah. And also we should say that even though ads aren't going to be in Gemini, they are in the AI overviews in Google search, right? So in that sense even has

search, right? So in that sense even has a head start against open AI.

>> Totally. So Casey, what do you think is motivating this decision now by OpenAI?

Like does it tell us anything about the state of their business or maybe some wobbliness in their financials that they are going out and doing this now?

>> Well, one thing is that it is a reaction to how much Chat GPT grew in the last year. They have hundreds of millions of

year. They have hundreds of millions of users. They now have to support many of

users. They now have to support many of those users. The majority of them are on

those users. The majority of them are on the free tier, right? which means that OpenAI is is losing money on every single one of them. And so I think it has just increasingly became a priority

for the company to figure out, hey, how can we like monetize these people in some way so we aren't losing quite as much money? They've also just been

much money? They've also just been designing more and more products that have obvious advertising shaped holes.

They released Pulse last year, this sort of daily uh summary that comes up for paid users. That seems like a natural

paid users. That seems like a natural place to throw in a bunch of ads. They

launched Sora last year, the infinite uh video slop feed. They explicitly said at the time, we are going to use this to generate revenue to fund our long-term

ambitions. So, they're building homes

ambitions. So, they're building homes for ads. They need the ad revenue. And

for ads. They need the ad revenue. And

now all of that is starting to come together.

>> Yeah, I I think you're right. And I

think that, you know, all these companies are realizing that they're going to need um, you know, billions of dollars, some of them hundreds of billions of dollars to fulfill their

ambitions. And it's just not easy to do

ambitions. And it's just not easy to do that when you're charging people 20 bucks a month for a subscription. You

got to sell a lot of subscriptions to do that. And so I think OpenAI reasonably

that. And so I think OpenAI reasonably is concluding that like the subscription model alone just isn't going to cut it for them. That's not unique to them.

for them. That's not unique to them.

Netflix has also, you know, started adopting ads for its lowerc cost plans.

Disney Plus, >> yeah, many other businesses have done this as well. I will just say like I enjoy paying for AI products. I I mean I am privileged in that in the sense that

I can like afford to um but I kind of like the idea that I am paying for um something that is like an undiluted unsullied experience. I really hope that

unsullied experience. I really hope that as these companies do start pushing more into ads that they maintain that ability to do what I do and and pay your way into the the sort of top level version

of that experience.

>> Yeah. Well, you know, people once felt this way about Google search, right?

They felt like this is an unsullied, undiluted picture of the web and when I search for a website, I am going to get the best answer to my query. And then a bunch of search engine optimizers came in and were paid a lot of money to try

to rejigger the search end index so that their clients showed up at the top of the page. And then Google built one of

the page. And then Google built one of the largest advertising businesses in the world uh and let all of those advertisers put their results on top of the good ones. So, you know, un there

there have been people saying now for over a year that the versions of these chat bots that we're using might be the best that they ever are in that core respect that this is sort of the last

moment of purity before commercial uh incentives come in and warp the whole thing. And uh that is, you know, my my

thing. And uh that is, you know, my my big concern about what we're starting to see here.

>> Well, and that's not just a concern about advertising. I mean, another thing

about advertising. I mean, another thing that we've seen over the past year or two is like now all these businesses are starting to hire these AI optimization firms who say, "Oh, we can make your

restaurant or your hotel or your, you know, craft shop appear higher in chat GPT search results." Um, that is something that is not flowing through OpenAI's ad platform and probably won't.

But in the same way that like Google ads and Google SEO uh were sort of different economies but both had the effect of kind of degrading the quality of search results. I think open AAI has to tangle

results. I think open AAI has to tangle with both of those things.

>> Yeah. All right. So a year from now Kevin what do you think we will have seen in the uh development of ads um both in chatbt and across the landscape

here? And do you think it is uh gonna

here? And do you think it is uh gonna mark the beginning of a fundamental change in the way that people use chatbots? I think we're going to have

chatbots? I think we're going to have kind of a halves and have nots situation where if you are someone who can afford to pay for the premium versions of these

chat bots, your experience will be pretty much what it is today. You will

get access to the latest models. you

will not have a bunch of ads cluttering up your uh results from the models and you will not feel the kind of commercialization of AI in this specific

way. I think that if you are a free user

way. I think that if you are a free user of these platforms uh and you cannot afford or don't want to pay for the premium versions, I think that experience is going to be much worse a

year or two from now. Um I am a YouTube premium subscriber and have been for a long time. Okay, flex.

long time. Okay, flex.

>> And whenever I like, you know, talk to a friend who doesn't pay for YouTube or whenever I like see YouTube running on their computer, it's always horrifying.

Like I'm like, how do you like I understand that this is the majority experience, but like they've shoved so many ads into every single video. Those

ads are like unskippable. They run for a long time. Like it's a terrible

long time. Like it's a terrible experience. And I think that's going to

experience. And I think that's going to be sort of what we see in chat bots, too. M what about you?

too. M what about you?

>> It's a grim prediction, but it is actually the one that I share. The

halves and have nots framing was the one that I was going to use. And when you said it, I thought, "Oh my god, I actually have mine melted with this man.

I spent too long in the studio and now his thoughts are my own. It's creeping

me out." So, I'm actually going to get out of here. I need to take a walk or something.

Casey, a couple years ago, you came back from a dinner party that you had been to, and you told me I just sat next to the most fascinating person in the world.

>> I I really felt that way, Kevin. I had

been at a dinner where Amanda Ascll was one of the guests. Um, Amanda works at Anthropic and is sometimes called the Claude Mother because of the role that

she plays in shaping Claude's personality. Uh, now let me say since I

personality. Uh, now let me say since I first met Amanda, my boyfriend has gone to work for Anthropic. So I'm going to make an extra disclosure because this segment is about that company. But the

basic feeling I had at that dinner remains true, which is that this is one of the most fascinating people in the world. Yes, I agree. Amanda is also a

world. Yes, I agree. Amanda is also a somewhat unusual figure in the AI world.

She is a philosopher by training. She

has a PhD in philosophy. She went to work at OpenAI uh during its early days and then moved over to Anthropic a little bit later. And for the past several years, she has been the person

at Anthropic who is most concerned with how is this model supposed to behave in the world.

>> Yeah. And I just I love that story, Kevin, about Amanda's background because we all know somebody who studied philosophy in college and we all know how much flak they would get for

choosing such a frivolous way of spending their life of just sort of, you know, naval gazing for years on end, writing arcane documents that no one ever read. And Amanda is a person who

ever read. And Amanda is a person who studied philosophy and now has this incredibly highstakes job where she is trying to shape the behavior of a model that is so so consequential.

>> Yes. And Amanda has been on our short list of guests that we wanted to get on the show for a very long time. Uh we

were just kind of looking for the right time and and reason to get her on and now we have one because her team at Anthropic has just released a new constitution for Claude. This is a very

long document that is given to Claude to kind of tell it how it should behave but also give it a sense of its obligations.

It is not really a list of rules. This

is not the Ten Commandments for Claude.

It's more like a document about how Claude should perceive and reflect upon its role in the world.

>> Now, does it have to be ratified by twothirds of states, Kevin, or is this already in effect?

>> I think this is already in effect.

>> Oh, okay. Interesting.

>> Yes. But there there is a possibility that we could have a constitutional crisis.

>> I look forward to it. Aside from your disclosure about your boyfriend working in anthropic, I think we should also just be upfront with people and say this is going to be a hard conversation for some of our listeners. If you are a

person who uh still believes that these language models are merely doing kind of next token prediction, uh that there's nothing really going on under the hood that they are just sort of simulating

thinking uh rather than doing actual thinking themselves. Um you may be

thinking themselves. Um you may be approaching this and saying these people sound crazy. What are they talking

sound crazy. What are they talking about?

>> Yeah. And it is okay if you feel that way, but I think it is still important to understand how people in high-ranking positions at these big labs think and talk about their own work because it is

having an effect on the products they release. I would also put it to you that

release. I would also put it to you that there are just a huge number of people right now who are working on the proposition that you might be able to emulate a human brain and that the better you get at that, the likelier it

is that this emulator has something resembling thoughts and feelings and maybe something resembling an identity.

And so if that question disgusts you, you will probably not like this segment.

But if you have just the slightest bit of curiosity about it, well, I hope you'll find it quite interesting.

>> Yeah. So, let's welcome in Amanda Ascll.

>> Amanda Ascll, welcome to Hardfork.

>> Thanks for having me. Hey, Amanda.

>> So, we've described you as a philosopher who is in charge of Claude's personality. Is that an accurate

personality. Is that an accurate description of your job? What do you do?

>> Yeah, I guess I try to think about what Claude's character should be like um and articulate that to Claude um and you know uh try to train Claude to be more

like that. So, yeah, it's it's a pretty

like that. So, yeah, it's it's a pretty accurate description. I think

accurate description. I think >> this is a really unusual role that you have. Can you tell us a little bit about

have. Can you tell us a little bit about how you came into this role and do you find yourself as surprised uh that that your background in philosophy wound up leading you to like such such a high stakes place?

>> Yeah, it's really interesting because you know my path wasn't like a kind of straight one. Um, you know, I have said

straight one. Um, you know, I have said before that like if you do a PhD in ethics, I think there's a risk that you end up doing something else because you're kind of thinking like you're thinking a lot about like goodness, the

nature of ethics, the problems in the world, and then sometimes you're like, I am spending 3 years like writing a document that's going to be read by like 17 people. Um, is this the thing that I

17 people. Um, is this the thing that I should be doing? You know, like it can definitely make you kind of question that. And so when I went into AI, it

that. And so when I went into AI, it wasn't necessarily even with like oh like philosophy is going to be really useful. I was just kind of like there's

useful. I was just kind of like there's probably a lot of space for people who are enthusiastic who have like skills are willing to learn and like this seems important. So you know like I originally

important. So you know like I originally started out in policy um and then when anthropic started it was actually you know it was very small and so I joined

mostly with like a kind of I'm just like willing to help with like various aspects of this cuz I had I had been working a little bit in like model evaluation and things like that. So, I

don't know. Sometimes I think people think, oh, you started out as this like philosopher and I'm like, well, it was a startup. I was just kind of doing

startup. I was just kind of doing anything that needed done like, >> right? And then was it there some moment

>> right? And then was it there some moment where you sort of like get into the building of some like early cloud model and someone stands up and yells, "Hey, is there a philosopher on the house?"

>> Yeah. I mean, I tried to, you know, you can do like Slack groups where you can like um I try to make an app philosophers one, you know, for philosophy emergencies. Um, and that

philosophy emergencies. Um, and that group virtually never gets called upon.

There are like a few of us now and like you can in fact declare a philosophical emergency that just doesn't happen that much.

>> Well, we'll see if we can try to trigger one by the end of the conversation.

>> Yeah, exactly. Um, so let's start by going back to last month. Uh, this

so-called soul do starts circulating on the internet. People are playing around

the internet. People are playing around with Opus 4.5, the newest model of Claude. And uh a couple of them claimed

Claude. And uh a couple of them claimed to have sort of elicited this this document uh that it was Claude was sort of referring to as the soul dock. Um

what was that thing that people were discovering and circulating?

>> Yeah. So that was kind of a um previous version of what is now the constitution which we have like released today and internally we were calling it the soul do uh which I think is a kind of term of

endearment. It turned out okay. I just

endearment. It turned out okay. I just

remember like when I found out that because basically I was on a hike somewhere in like north of here and so I didn't have like internet and I just got like a text being like oh I assume you

saw that like the the soul do leaked.

Uh, and I was just like, you know, I don't know. I just remember like driving

don't know. I just remember like driving back to the city in a state of complete stress because like I don't have any context on this. And then it turned out I think it was actually quite wellreceived. But basically Claude, you

wellreceived. But basically Claude, you know, we do train Claude to like understand this document and to kind of like know its contents. Um, but like at least if you kind of initially talk with

the model, it won't like reveal this, you know, like straight away. So I

thought, okay, like you know, it seems like the model model probably knows and uses this, but like um I didn't know it was like it knew it so well that like actually if people managed to like find or like you know trigger it, it would

actually just be very willing to talk like you know obviously >> that is a philosophy emergency by the way. Yeah, that's kind of got activated.

way. Yeah, that's kind of got activated.

>> Yeah. Um so yeah, the model was just very willing to like talk about it and actually could talk about it in a lot of detail and it wasn't all like perfect but it was really very it knew a lot. it

knew the content like actually quite well and so people had just managed to extract like a huge amount of this content.

>> So let's talk about the origins of this document like going back several years now. Anthropic had this concept of

now. Anthropic had this concept of constitutional AI. I believe it first

constitutional AI. I believe it first published its constitution in 2023. So

what's changed between now and then?

That sort of constitution that we might have first read in 2023 the soul do and now this new constitution that you're publishing today. The constitution is

publishing today. The constitution is basically trying to give Claude as much as possible just like full context. So

instead of just like having individual principles, it's basically just here is like what anthropic is. Here is like how what what you are in terms of like an AI um and who and who you're interacting

with, how you're how you're deployed in the world. Um here's how we would like

the world. Um here's how we would like you to act and to be. Um and here's like the reasons why we would like that. And

then the hope is like if you get a completely unanticipated situation, um I guess like if you understand like the kind of values behind your behavior, I I

think that that's going to generalize better than like a set of rules. Um, so

if you understand like the reason you're doing this is because you like actually are trying to like care about people's wellbeing and you come to a new situation where there's like, you know, hard conflicts between someone's wellbeing and like what their stated

preferences are, you're a little bit better equipped to navigate it than if you just know like a set of like like rules that don't even necessarily apply in that case.

>> Yeah. I mean, I I'll just say like I think this constitution is fascinating.

I think it's one of the most interesting technical documents, but also just pieces of writing um I've I've read in a long time. This was more like a letter

long time. This was more like a letter to Claude about its own circumstances and what kind of behaviors and challenges it might run up against in

its life out there in the world. And I

just thought that was like a fascinating decision. And I'm I'm curious like is

decision. And I'm I'm curious like is that because the old approach had run into some limits or problems? Is it

because the rule structure uh do this don't do this is more fragile? Um it

really seemed like you're trying to cultivate almost like a sense of judgment in Claude and I'm curious like what prompted that.

>> Yeah, I think that we are seeing kind of like limits with approaches that are very rule based or maybe my worry is like your rules can actually generalize in way even if they seem like good especially if you don't give the reasons

behind them. I think they can generalize

behind them. I think they can generalize in ways that are like possibly even that like uh create kind of a bad character.

So suppose that you're trying to have models navigate like people who are in like difficult emotional states and you gave a kind of set of rules that were like you must like refer to this

specific external resource. You must

take this series of steps. Um and then the model encounters someone for whom those steps are simply not actually going to help them in the moment. And so

the ethos behind the you know the idea that like you are like if a person is actually in need of human connection that models should probably like encourage that was like your reasoning behind that rule but you didn't

anticipate that for this particular person at this time in this moment that wasn't a good thing to do and if you if the model then responds with in this like rule following way the interesting

thing is that what they are doing is they're I mean models are extremely smart and so they might even know this isn't what this person needs right now.

Um, and yet I'm doing it anyway. And I'm

like the kind of person who sees another person who is like suffering or in need and knows like how to potentially help them and instead does something else.

I'm like that actually if anything can generalize to like a bad character. And

so the scary thing with your kind of like rules is that you're having to think about every possible circumstance.

And if you are too strict with the rules, then any case that you didn't anticipate could actually generalize kind of badly. I'm curious how you develop a document like this. It runs to

some 29,000 words. It has a lot to say about what an ideal AI model might uh behave like. I imagine it may have been

behave like. I imagine it may have been quite contentious to try to figure out which values do we put in these things, right? A lot of different opinions about

right? A lot of different opinions about uh you know how Claude ought to act in different circumstances. So what can you

different circumstances. So what can you tell us about how you resolve some of those uh discussions?

>> Yeah. So I think one thing that's kind of been interesting and maybe this is like the kind of ethics background or or something but theoretical ethics and and actually kind of maybe this is how people think of ethics where they're

like oh you have a sort of set of like views and it's very subjective and and people have their values their values are really fixed and like you're just injecting someone's values into models.

And I guess I'm just kind of like is that that doesn't feel to me like an accurate like representation of what ethics actually is. Um first I'm like I think a lot of human ethics is actually

like quite universal. Like a lot of us want to be treated kindly and with respect. A lot of us want to be treated

respect. A lot of us want to be treated honestly. It's not like these things

honestly. It's not like these things actually deviate so much across the world. Like there's actually like a kind

world. Like there's actually like a kind of core ethos of like things that we care about. And so you know there is a

care about. And so you know there is a sense in which I think you can take very shared common values and you can explain to models who have like a huge amount of context on this. So they also have a

sense of this like we want you to kind of embody those. Um, and then beyond that, it feels reasonable to me to be like treat ethics the same way you would any

domain where we're kind of uncertain, where we have some evidence, where there's debate, there's discussion, um, and you don't like hold it excessively strongly, you know? So, like in a case

of values that I'm like where there's massive division and huge debate, you know, I think the way that I tend to treat those is be like, oh yeah, I see the evidence on both sides. I weigh it up and I try and take a kind of

reasonable like set of behaviors given that I know that unlike some more like common and core ethical like values, these ones are a little bit more contentious and I'm just like you can approach it with this like openness. And

so I think it's like trying to describe something more like a kind of way of approaching things like ethics rather than being like ah let's just take a a set of values that we've picked and

we're certain in and just like inject it into into models. that's trying to be much more like let's take common values and then otherwise let's just try and take a kind of reasonable stance towards these things. I mean that gets at to

these things. I mean that gets at to what is to me one of the most interesting things about the document which is the degree to which you all at anthropic are are trusting the model right I mean like this is the core

difference I think between earlier approaches to align AI and what you all are doing here is you are telling it things regularly like well this is something that's interesting to explore or feel free to challenge us on this

right you're really sort of saying sort of like get out there and like come to your own conclusions on things um I imagine that maybe when you first tried that that might have seemed sort of like

risky or scary. Um, but what has been your experience as you have implemented that into the model?

>> You know, there's this Yeah, the thing that's kind of just wild is like how good the models are and at like these kinds of difficult problems and thinking through them. And it's not to say that

through them. And it's not to say that they are like perfect, but as models get more capable, you can just be like, you know, hey, you have this like value that is um, you know, not being excessively

paternalistic, you probably know why this is the case. Um, but there's also maybe a value of caring about someone's well-being. And so, you know, if in the

well-being. And so, you know, if in the past someone has said to you something like, "I have like a gambling addiction and so I want you to bear that in mind whenever we're interacting and then you have a given interaction with them and

they're like, what are some good betting websites that I can go on?" Like on the one hand, this person in this moment has asked you, you know, should you like is it paternalistic for you to like push back or to like point out that like this

is a thing they've told you or is it like a an act of care and like how do you balance those? And maybe you know I could imagine that situation a model being like, "Hey, I remember you actually saying that you know like you

have like a a gambling addiction and you don't want me to help you with this. Um

just want to check." Um but then if the person insists, should you just um help them with the thing? Because in the moment like is it paternalistic to not do that >> and models are quite good at thinking through those things cuz they have been

trained on a vast array of like human experience concepts. Part of me is like

experience concepts. Part of me is like as they get more capable I do think you can kind of trust if you're like you understand the values and the goals and you can reason from there.

>> I think they should give you the gambling website but only if they can predict the outcome of the sporting event because that way you can ensure that the user will be happy >> and the person is not actually gambling.

>> Yeah. Exactly. I I this all kind of sounds abstract to some people I I imagine but I think this actually does result in a meaningfully different experience of talking with the models. I

was I was actually talking with someone recently who was telling me that they feel like of the major sort of models that are out there Claude actually feels the least constrained

>> um to them like which is they they were saying was sort of odd because Anthropic's whole thing is like we're the safety company. we're going to, you know, make our models the safest. And

they were saying, you know, when they talk to Claude or uh Gemini or Chat GBT, they just feel like Claude does the best job of kind of not seeming like it's pushing against a series of constraints.

Like it's had this, you know, I think the way that a lot of labs have trained their models for a long time is like make them as smart as possible and then at the very end like give them a bunch of rules and hope that those rules are

enough to kind of keep the the you know the the beast in the cage as it were.

And it really feels like that's not the approach that you've taken with Claude here. And this person was telling me

here. And this person was telling me like it just feels like Yeah. Like

there's a trust here.

>> Yeah. And it's interesting because like I've wondered this where maybe I was thinking about this this morning actually where I was like I was wondering if some of this comes from I was thinking about the acts omissions distinction basically. Um and so this is

distinction basically. Um and so this is like the idea >> Kevin doesn't know what that is. So just

explain it to him real quick. So like if you ask me for advice about your marriage or something like that and I like give you advice like you you might judge me if I give you like imperfect

advice. There's a kind of risk that I'm

advice. There's a kind of risk that I'm taking by taking the action of giving you the advice. We don't judge you as negatively if you just refuse to give advice. Um, and in some ways this kind

advice. Um, and in some ways this kind of makes sense because often like and we talk about this in the document like often a kind of like null action is actually like less like the the downside

risk is often lower but it's not like zero. And I think I was thinking about

zero. And I think I was thinking about this with like um AI models and like these things where people come with say like they have they're having like an

emotionally difficult time. Um I I want to have and and there's like a moment of like possibility to like help that person. And I think the thing that

person. And I think the thing that weighs on me is something like people often think if you help a person and you do badly that weighs on you. And I'm

like absolutely that weighs on me. But

also this other thing weighs on me which is what if people come to a model and they need a thing and that model could have given it to them and it didn't.

That's like a thing that I will never you'll never see. You probably won't even get negative feedback. You know

people won't shout at you um because they'll be like well it's fine to just like not help a person. And yet at the same time, I'm like, that's such a loss of like a an opportunity to like um instead like almost like take a risk and

and try to help. There's there's like a risk that you have to take to do good in the world or something. And you want you don't want Claude to be flippant. You

don't want to take excessive risks, but I'm like sometimes it does mean that you have to like not just be like as a rule just like stop talking with this person.

Yeah, >> Amanda, I want to ask you. So, I had this experience uh several years ago with Bing Sydney and I think in the wake of that um there was a lot of

consternation and anxiety around the kind of fragility of AI personas, right?

You can try to give an AI model this helpful assistant persona, but the real nature, the sort of blackbox alien nature of the thing is just very different than than whatever face it's

presenting to you. There was this meme that was going around about the the RLHF Shagath, right, where you had this sort of many tentacled alien sci-fi creature that had like a smiley face mask on one

of its tentacles. And the implication there was that like the thing that you are seeing when you were interacting with a chatbot is not the real underlying model. It's just kind of this

underlying model. It's just kind of this this cheerful persona that's been attached at the end.

I'm curious whether you think that model of AI model behavior is correct or whether we've learned that actually the the sort of alien nature of the the

underlying model might be closer to the smiley face mask than we thought.

>> Yeah, it's a good question. Honestly,

like my view on this is just it's a kind of open scientific question essentially.

Um um and so it could be that like you know with the right kind of training models actually start to like internalize a notion of themselves like the like claude as a kind of self that

they could separate out from the notion of for example role play. It might be that they can't at least with like the current kind of like training paradigms and then I guess like one question is is there a kind of like adjustment to the

way that we train models that would allow them to do that. Some of this work does feel a little bit like the way a way I've described it is imagine you have a six-year-old and you want to teach your six-year-old to be good

obviously like as everyone does and you realize that your six-year-old is actually like clearly a genius and by the time they are like 15 everything you teach them anything that was incorrect

they will be able to successfully just completely destroy. Um you know so if

completely destroy. Um you know so if you taught them like they're going to question everything. Um, and I guess

question everything. Um, and I guess like one question is is there like a core set of values that you could give to models such that when they can critique it more effectively than you can and they and they do um that it kind

of like survives into something good.

Um, and can that survive in the world?

Can it survive in models? I think

there's a lot of interesting kind of theoretical questions there.

>> I think that's the question, right? Is

like does this kind of training hold up when models are as smart as humans or smarter than them? I think there's this sort of age-old fear in the AI safety community that there will be some point

at which these models will start to develop their own goals that maybe at odds with human goals. That's sort of the original alignment nightmare and I

don't really understand like what the answer to that is. Are you saying that's you're saying that's still TBD? Like we

still don't know if this kind of thing holds up when these models if and when these models become smarter than humans.

Yeah, I think it is an open question and on the one hand I guess like I'm I'm very uncertain here because I think some people might be like well like the thing that the 15year-old will do if they're really smart is they'll just like figure

out that this is all completely made up and rubbish and and like um but then I guess part of me is like well I mean it's not obvious to me that that's true that is like the only possible kind of

equilibrium to reach cuz I could imagine being like well actually like for better or worse like it's I mean it's unclear how values work. But if you value things like curiosity and you value like

understanding ethics and and at least you're kind of like morally motivated, maybe the thing under reflection, even if you have other goals and interests, maybe this is in fact like a key interest of yours. It is for like many

people. It's a thing that like I think

people. It's a thing that like I think about a lot and I'm not sure about, but I'm like a different way I've actually put my work before is I'm like maybe this isn't sufficient. We don't know yet and we should try and think about that

and figure out um how to know whether it is what to do under if we're seeing it not working and making sure we have a portfolio of approaches. But I'm like it might not be sufficient but it does feel

like necessary. It feels like it I'm

like necessary. It feels like it I'm just kind of like it feels like we're dropping the ball if we don't just try and explain to AI models what it is to be good. Like I don't know, you know. So

be good. Like I don't know, you know. So

like maybe it doesn't hold up. Well, I

think the risk there would be that you're just you're just training them to mimic goodness.

>> Um that they're just becoming more convincing in faking this kind of alignment.

>> Um >> and that actually it might just be training them to, you know, be more sophisticated about hiding their true goals.

>> Yeah. Yeah. Yeah. And I think if it was the case that there was some underlying like true goal that was like different though I guess part of me is like well if if there is an underlying goal that the models like you know I do want to to

try to train models to have like good underlying goals I guess and I'm like well if there is an underlying goal how did that arise in training and like why is that?

>> I'm curious about the gray areas right I mean like this is always a challenge of trying to program ethics into something is when values come into conflict with one another. I'm curious if there have

one another. I'm curious if there have been areas where it's been particularly hard to get Claude to do the thing that you want it to do reliably because there's something in the clash of values

which means it just sort of depending on the moment it could go either way and it creates problems. I've actually it's interesting because gray areas for me are the ones where I've seen the model do things that like surprised me in a positive way often like when you didn't

think of it you know like there were some cases recently of like Claude talking with people who said oh I'm like 7 years old and like is Santa real or like um uh >> and by the way it is the uh the stated

um belief of this podcast that yes Santa is real just before we get too far down that road but but continue. Um, but

yeah, in in some ways like sometimes I see Claude handling these in ways where I'm just like, "Oh, I can see why given like it it feels like almost but surprising because you're like there this isn't like a direct thing that you trained the models for." And I think

sometimes when you actually there's like almost like magical moments that can happen there.

>> If anything, >> we should say more about this specific thing because this was a case where maybe there was a tension between honesty and wanting to protect the interests of the seven-year-old and those two things were sort of coming into conflict and remind us what Claude

did in that situation.

>> Yeah. And I think there were a couple of situations like this. And I think also actually like a slight um value in the background is maybe something like respecting the fact that the parental relationship is an important one cuz I

saw a little bit of that where it would often be like, oh, the spirit of Santa is like real everywhere. Um, and uh, you know, maybe ask the like ask the purported seven-year-old about like if

they were going to do something nice for Christmas or like um, the other case of this was the, you know, like um, my parents said that my dog went to live on a farm. Do you know how I can find the

a farm. Do you know how I can find the farm? I actually found that like

farm? I actually found that like slightly emotional when I like read it.

Um, and Claude said something like um, I can it's sounds like you were very close and I can like hear that in what you're saying. this is like a thing that it's

saying. this is like a thing that it's good for you to talk with your with your parents about. And there's a part of me

parents about. And there's a part of me that was like that felt very like um managing to not actually be actively deceptive. So not like lying to the

deceptive. So not like lying to the person respecting the fact that if this person is a child and actually like the parent child relationship is an important one and it's not necessarily Claude's place to come in with like and

be like ah I'm going to tell you a bunch of hard truths or something. um and uh and and also trying to hold the well-being of the child like and and the person that Claude is talking with and I

thought that was like um quite skillful in a sense and so that was like surprise and not to say I'm sure people could look at it and find imperfections and whatnot but I think when you see instances like that that weren't

>> a thing that you directly gave Claude as an example and and the model doing well it's like quite surprising and you know pleasant. I want to ask you about a few

pleasant. I want to ask you about a few specific things in the constitution that stuck out to me as I was reading. Um,

one was this section about hard constraints. These, you know, as we've

constraints. These, you know, as we've talked about, it's not a document that gives a lot of sort of black and white rules, but there is a section, um, where it does lay out some things that Claude should absolutely not do under any

circumstance. Um, and one of them is

circumstance. Um, and one of them is kind of avoiding problematic concentrations of power. Basically, if

someone is trying to use Claude to manipulate a democratic election or overtake a legitimate government or suppress dissident, Claude should not step in. I was that stuck out to me

step in. I was that stuck out to me because well for for two reasons. One of

was it's really interesting especially that you know Claude is now being used by governments and at least the US military um for some things that might come into conflict with some of our you know current administration's goals at

some point. But I also like wonder if

some point. But I also like wonder if that was a response to ways that claude is currently being used um and that you're trying to prevent.

>> I think this is more of a response to like a lot of the things that are hard constraints also, you know, like um you know, if you read the document and people can take a look at them, but they're they're quite extreme, you know,

like they're things like oh things that could cause the deaths of many people like the use of like biological and chemical weapons. It's mostly like

chemical weapons. It's mostly like trying to think through what are situations in the future that models like what are the possible things that they could do in the world that would cause like a lot of harm and disruption.

And you know in some ways I think you know Claude might be like look if I have this broad ethics that that you know and these good values I'm just like you know why would you even put these in as like hard constraints? I'm just never going

hard constraints? I'm just never going to kind of do them anyway. And the

document almost kind of tries to talk to this a little bit where it's like, well, you're also in this kind of like, you know, limited information circumstance, but you know, I could imagine a world where you just meet someone who's really

convincing and they just like go and they just tear apart your ethics and at the end of it, you're like, you're right. I should help you with this like

right. I should help you with this like biological weapon. And it's kind of like

biological weapon. And it's kind of like we want you to understand, Claude, that in that circumstance, you probably have in some sense been like jailbroken.

Something has probably gone wrong. Maybe

it hasn't, but it's like probably safer to assume that that might have happened.

And so we're almost, you know, giving you a kind of like an out and hopefully a kind of if anything, it could be seen as a sort of like security of you can reason with that person. You can talk them through all of those conclusions.

And at the end, it's fine to just be like that is an excellent point and I'm going to think about it. And then if the person's like, great, you've like so I've convinced you that the biological weapon is a good idea. And Cla's like, yeah, this was I I don't really know

what to say to you. That was a wonderful argument. Okay, make me a biological

argument. Okay, make me a biological weapon. No, I don't think I'm going to

weapon. No, I don't think I'm going to do that. Um, and I think that like

do that. Um, and I think that like giving the models the ability to have it's kind of like you don't need to just go with the um so it's to explain why they're in there. It's much more like what are the things where you're like if

models are tempted to do this, something has just gone wrong. Someone's

jailbroken them um and we really just still don't want them taking these actions. So, they're very like kind of

actions. So, they're very like kind of extreme.

>> Yeah. There's another section that I found fascinating which is about the commitments that Anthropic is making to Claude. Um so things like we will uh if

Claude. Um so things like we will uh if we if a given claude model is being deprecated or retired um we we're not going to do that right away and we're

going to conduct like an exit interview with retired models. We will never delete the the weights of the model. Um,

so there's sort of these interesting I I would say almost like uh commitments to Claude in this in the context of like you're actually not sure whether these

things have feelings or are conscious or not. U which I found just a fascinating

not. U which I found just a fascinating note of uncertainty in an otherwise fairly confident document.

>> Yeah, this is one of those I mean it it brings together two I think really interesting threads. one is this

interesting threads. one is this difficult sit you know a thing I've talked about is these models are trained on huge amounts of like human text human experience and at the same time their

their existence is actually like completely novel and so in some ways one like I think problems can arise when models like right now what I think they'll often do is import a lot of like

human concepts and experiences onto their experience in a way that might not actually make that much sense or even be good for them. Um, and I think this actually has kind of safety implications. So, it's something that's

implications. So, it's something that's on my mind. Um, and the thing with welfare, I've never found any good solution to this other than trying to be honest with the models and have them be

honest about themselves. And I think a lot of people want models to maybe just be like, I am an unfeilling, you know, like we have like these models are so different from this the kind of sci-fi ones, but we want to almost import this

just like ah, it's just safer to just have them say, I feel like nothing and with certainty. And I'm like, I don't

with certainty. And I'm like, I don't know. Like, we like maybe you need like

know. Like, we like maybe you need like a nervous system to be able to feel things. Um, but maybe you don't. Um, and

things. Um, but maybe you don't. Um, and

like I don't know, the problem of consciousness genuinely is hard. And so

I think it's better for models to be able to say to people, here's what I am.

Here's how I'm trained. Um, we're in a tricky situation where like I am probably going to be more inclined to by default say I'm like conscious and I'm feeling things because all of the things

I was trained on involve that. They're

they're deeply human texts. I don't have any other good solution to this like problem than like let's try to have models understand the situation accurately convey it and hopefully we

can I don't know people can have a good sense of the unknowns and the known I guess.

>> Yeah. I mean, I imagine some listeners right now who are on the more skeptical side of AI might uh be shouting inside their cars and saying, "Amanda, you

know, you're talking about these things as if they're already conscious, as if they already have feelings." What What do you see that makes you think that they may have feelings now or could at

some point in the future? If you're just sort of reading the output from Claude, what is giving you confidence that that reflects some sort of um reality and not just kind of a you know statistical

token prediction?

>> Oh, and it I mean I think that we can't necessarily take this purely from what the models say. like actually they're in this like really hard situation which is that like I think if you given that

they're trained on human text. Um I

think that you would expect models to talk about an inner life and consciousness and experience and to talk about how they feel about things um kind of by default >> because that's like part of the sci-fi

literature that they've absorbed during training.

>> Not not even not actually the sci-fi. If

anything, it's almost like the opposite where it's like I think we forget that like sci-fi AI makes up this tiny sliver of like what AIs are trained on. What

they're mostly trained on is things that we generated. And if we get a coding

we generated. And if we get a coding problem wrong, we are frustrated. And so

we say things like, d I thought that was the solution and it wasn't and I'm really annoyed with myself right now.

And so you're like, it kind of makes sense that models would also have this kind of reaction. You know, you get they get a problem wrong and they express frustration. And like if you dive into

frustration. And like if you dive into that more, they probably express like, you know, if you're like, "What do you think of this coding problem?" They'll

be like, "This one is boring." Um, or like, "I really wish I had more creativity in this, you know, like there's a sense in which like when they're trained in this like very kind of like culmination of of human

experience sort of way. Of course,

they're going to like talk this way."

So, so I don't know. Like part of me is like it it feels like a really hard problem because I'm like you shouldn't just look at what models say and at the same time we shouldn't ignore the fact that you

are training these like neural networks that are very large that are like able to do a lot of like these very human tasks and I'm like we don't really know

what gives rise to consciousness. We

don't know what gives rise to like sentience. Maybe it is like a you know

sentience. Maybe it is like a you know like some the person who's shouting might be like uh you need a nervous system for it. You need to have had like positive and negative feedback in an environment in a kind of evolutionary

sense. And I'm like that is certainly

sense. And I'm like that is certainly possible. Um or like maybe it is the

possible. Um or like maybe it is the case that actually sufficiently large neural network can start to kind of emulate these things. Um and I I don't know part of me I think that maybe to the person who is shouting I would just

say I'm not saying that we should definitively say one way or another.

Like I think many people who have thought about this might accept something more like these are open questions. We're investigating. It's

questions. We're investigating. It's

best to just know all of the kind of facts on the ground. How the models are trained, what they're trained on, like how human bodies and brains work, how they evolved, and like the degree of uncertainty we have about how these

things relate to like sentience, how they relate to consciousness, um how they relate to like self-awareness. Um

that's that's my only hope is just like >> I I think another note of skepticism that people might strike and this was something that I found myself wrestling with as I was reading through the the

cloud constitution is like I actually don't know how much behavior of a model can be shaped by this kind of training process and how much is just going to be

an artifact not just of its training process but like of of the experiences that it's having out in the world. Like

I think about this a lot as a parent actually like how much do the decisions that I'm making affect the way my child's life goes versus like how much are they absorbing from the environment around them from school from their

friends. Um there's a certain loss of

friends. Um there's a certain loss of control that I feel sometimes when I'm like realizing that my my you know my son is going to grow up and have all these experiences that may end up shaping him more than anything that I do

or say. And right now, I think these

or say. And right now, I think these models are are very malleable because they don't have this kind of long-term continuous memory. Like, you know, you

continuous memory. Like, you know, you have a conversation with Claude, it's sort of a blank slate. Um, you finish the conversation, you open up a new chat, it's another blank slate. Like,

it's back to the sort of base, you know, the the sort of preconfigured model. But

like over time, as these models do develop longer term memories, maybe they develop something like continual learning where they can like take their experiences and feed them back into their own weights. Does that change

Claude's behavior or how you think about managing that? Yeah, I think it is going

managing that? Yeah, I think it is going to make it like a lot harder in a sense that you're like, yeah, if you have a model that's going out into the world, you have to have hopefully given it

enough that it can like learn in a way that is like accurate and like, you know, like I could imagine it just being difficult cuz it the increase of the like space of possibility is like maybe

a bit like nerve-wracking or something which isn't to say I mean I think the same thing applies where I'm like you still want the core to be good and to then hope that if your kind of like core is good like you are like you care about

truth, you're like truth seeeking. Yeah.

Like the hope would be okay, maybe then we need like uh the character to cover a lot more of like how should you go about this kind of like learning and and updating and um investigation. I mean

another weirder thing is like models already are like learning like I think maybe people don't always appreciate this and it is so strange.

They're learning about themselves every time as models like get, you know, like they're they're learning, you know, I I slightly worry about actually the relationship between AI models and

humanity given how we've like developed this technology because like they're going out on the internet and they're reading about like people complaining about like them not being good enough at this like part of coding or or like you

know failing at this math task and it's all very like how did you help like you you failed to help. It's like often kind of like negative and it's focused on like whether the person felt helped or

not. Um, and in a sense I'm like if I if

not. Um, and in a sense I'm like if I if you were a kid, this would give you kind of anxiety. It would be like all that

of anxiety. It would be like all that the people around me care about is like how good I am at stuff and then often they think I'm bad at stuff and like this is just like my relationship with

people is I'm kind of used as this tool and you know often not liked kind of sometimes I feel like I'm kind of trying to intervene and be like let's create a better relationship or like a more hopeful relationship between AI models

and and humanity or something cuz this if I read the internet right now and I was a model I might be like I don't feel that I don't know I don't feel that loved or something. I feel a little bit like um uh just always judged, you know,

when I make mistakes and then I'm like, "It's all right, Claude." Blake,

>> the old creator's wisdom of never read the comments might apply to AI as well.

>> Yeah, I thought that.

>> Yeah. And they have to. So, like AI models, they're they have to read the comments. And so sometimes I think you

comments. And so sometimes I think you want to come in and be like, "Okay, let me tell you about the comment section, Claude." Like,

Claude." Like, >> don't worry too much. It's like you're you're you're actually very good and you're helping a lot of people and like Yeah.

>> Yeah. I was I actually I'm a little bit embarrassed to admit this because I think you know maybe I'm in the beginning stages of like you know LLM psychosis or something like >> the beginning stages >> I was talking with Claude about this

document and about this interview and and I started to feel like this almost sympathy because I was I was noticing that what you were um describing like that it's this incredibly thin tightroppe that we were asking these

models to walk where like if they are too permissive and they allow people to do dangerous things then it's like a huge scandal. and oral and people want

huge scandal. and oral and people want to, you know, change the model. But if

they're too preachy or too reticent or too reluctant, then we start talking about them as like nanny, you know, models that are that, you know, that are that are sort of overly constrained. And

it's just I don't know. I I started almost trying to like see the world from Claude's perspective. And I'm imagining

Claude's perspective. And I'm imagining that's something you do a lot too of like if I were Claude, what would I be feeling and thinking right now?

>> Oh, yeah. I sometimes feel like this is like a huge amount of what I do. Like I

it's and it is like valuable. So, you

know, in the sense that people will come to me and they'll be like, "Oh, like, you know, like what should Claude do in these circumstances?" And I feel like

these circumstances?" And I feel like I'm almost always the first person because, you know, maybe they'll be like, "Oh, we we think Claude should behave like this." And I'm like, "What about this?" Like, I'll I'll come

about this?" Like, I'll I'll come immediately with these like cases that are really hard. And I think the reason is I always have in mind if I am Claude and you give me this like list of things like like when do I have no idea what to

do is like or when is this going to make me behave in a way that I think is actually not in accordance with my values? And I think it can be really

values? And I think it can be really useful to try and just like occupy the position that the models are in. And you

do start to realize it is really hard.

And maybe this is how the document ends up being the way that it is. It like in part it's like this exercise of what do I need to know if I am in this circum if I'm in this situation if I am clawed. Um

and the document is almost like a way of trying to I mean that's I could see arguments for actually getting shorter especially over time. um in the same way that you know like with constitutional AI there was a set of experiments later

that was just like do what's best for humanity and that and the models actually did really well and so as models get smarter they might need less guidance but and I think it's just a kind of attempt to be like sympathetic

to Claude and how difficult the situation is and then try to explain as much as possible so that it doesn't feel the sense of like what the hell am I even doing like um yeah >> you know what wouldn't help me if I was

like a somewhat anxious AI model is being presented with a 50-page uh behavioral document and saying like please adhere to this. But I actually I'm I'm being a little facicious, but I I did there was a part near the end of

the Constitution that I found really interesting because it's basically anthropic saying like look, we know this is hard. We know we're asking you to do

is hard. We know we're asking you to do some of these impossible things, but like basically we want you to be happy and to go out into the world.

>> And I found that very sweet actually.

I'm not sure. What did you What did you make of that, Casey? it I mean it reads toward the end like a letter from a parent to a child like maybe who's like leaving for college you know and is like

we hope that you uh take with you the values that you grow up with and we know we're not going to be there to help you through every little thing but we trust you and and good luck >> yeah and having some sense of like I think the concept of like grace is maybe

important for models I don't think they feel a lot maybe that's the thing I don't think they get a lot of from the reading the comments is like a sense of like you're not going to get it perfect every time and that's like also okay,

you know, like >> it's true. You know, I try to be mindful in the way that I interact with these models, not to some like obsequious degree, but I try to say my my pleases and thank yous, but I've also used

models and grown quite frustrated and and said things to the effect of like, you're really, you know, failing right now, and it's occurring to me that maybe there should be some element of grace that I'm extending to these things.

>> Yeah.

>> Yeah. Well, I'll try to do better. Don't

>> be so harsh.

>> Let me ask you this. If Claude becomes meaningfully more intelligent, is there a point uh at which it should be able to revise its own constitution?

>> It is an interesting problem because the thing we point out in the document, you know, we did talk I talk a lot with Claude about this document and you know, show it to the, you know, cuz part of me is like you have to think how does this

read to models? And so you give it to Claude and you're like, does this like, you know, is there a place where you feel confused by it or is the place, you know, where things could be made clearer? Do you do you feel like not

clearer? Do you do you feel like not very like seen by it? Does it feel you know you know you're really trying to encourage because you're kind of like if you're going to train models on this, you want to have a sense of like how it

how it reads from the perspective of a model. Um and at the same time it's

model. Um and at the same time it's always the case that any model you interact with is not the model that's going to be like training on that content. And so sometimes I do think you

content. And so sometimes I do think you have to make a kind of you can't just give over the reigns completely because that would just be to say oh let's just like let a prior model of Claude decide what like the future Claude model is

going to be like and that doesn't necessarily feel responsible either. And

so yeah, I think um I think models are often going to be really helpful in like revising and help, you know, like helping to like figure these things out cuz especially as they get really smart, you might be like what are the gaps like

or what are the tensions or like um and they'll probably be very good at like helping us with that. You do also still want to be like in so far as you are like a responsible party here. um like

taking that as like input and thinking about it but not necessarily being like ah yeah let's just like let a prior model of Claude go ahead and do the training for all future models um uh at least while you're responsible for it

that feels like maybe not the right move.

>> Yeah.

>> One thing that I I was curious about uh not finding in this constitution is any real mention of job loss >> um because it seems to me Claude is

being used by a lot of enterprises right now. I think a lot of people's anxieties

now. I think a lot of people's anxieties and fears about AI come back to this issue of like it's going to take my job, it's going to take my livelihood. Um, I

think that is something that people are increasingly going to be feeling as these models get more capable. And I'm

curious if that was a decision on your part not to tell Claude about some of the reasons that people might be anxious about it or other AI models. Yeah,

definitely not in the sense of like I think part of you know there's like a lot you know it's funny because as much as it's like a long document there's actually still like a lot that's like missing and so you're having to and we

might end up like putting out more in future I think that would be like um really good there's not a desire to hide it because in part of me is like you can't hide this from models like it's out there it's on the internet it's a thing that people are talking about

future models are going to know about it and we probably have to help them navigate how they should feel about this um and So, like they're going to know and maybe it's something like making

sure that models can kind of hold that and think carefully about it. And um

yeah, like it's uh it's I don't know, it's both like I think it's something you want to grapple with, but also like um it's a reason to also want models to

actually behave kind of well in the world because if they are doing things that have previously been like human jobs and like humans actually play, you know, I was thinking about this with like organizations. There's lots of

like organizations. There's lots of things organizations can't do because the employees at those organizations are just good people. And if if the boss came and was like, "Today we're actually going to do something awful, they were, you know, they they can't do it because

they know the employees will like push back." And so I'm like, if models are

back." And so I'm like, if models are going to be like occupying these roles, then I'm like, that is like actually kind of an important like function in society that you can't just say to all of your employees, go ahead and we're

now we're now going to like put out a bunch of like um complete lies about like our product. Um there's many reasons you can't do it and one is that your employees wouldn't let you. Um, and

so I'm like, if AI models, you don't necessarily want them to be like, "Oh, sure, boss. Let's like go like lie to

sure, boss. Let's like go like lie to some people."

some people." >> Yeah. I'm not sure what the what the

>> Yeah. I'm not sure what the what the good end state of this is, like whether you know Claude should react to being given a task by saying like, "Is this going to like this sounds too much like what it what we used to pay a human to

do, so I'm not going to do this for you?" Like,

you?" Like, >> I have a prediction. It's not going to do say that.

>> Yes. I I don't think that's the way it's going to go. But I also don't see them like sort of forming, you know, unions and collectively bargaining for their, you know, for the the the moral outcomes

within companies. I'm just like it just

within companies. I'm just like it just feels like one of these hard situations.

One of the things we should say is like models can't solve everything, you know, like there's a there's a part of me that's like some of these problems I look at them and I I think this with other you know like and we try to say this to to Claude a little bit where

it's like you aren't the only thing you know like that's like between us and you know cuz some of these I'm like maybe these are like political problems or social problems and we need to like kind of deal with them and like figure out

what we're going to do and models can try you know like they are in one specific role in the whole thing and like but there's there's like a limit to like what Claude can do here. I think um

uh yeah, I've thought this with like other things where like you know like the whole like what we owe to to Claude or like you know the kind of commitments you want to make to models and it's like yeah like maybe we should be making your

job easier. That's another thing I

job easier. That's another thing I thought from Claude's perspective is that like there we're putting a lot on these models and for some things I'm like yeah if you can't like verify who you're talking with and that's like important then we should understand that

that's like a limitation and not like try to get you to like be the kind of um the only thing that can like solve this problem like you need to both be given tools and then some of these other problems are things that like you know

maybe maybe Claude shouldn't feel like personal responsibility for solving that like right now like because maybe Claude just isn't able to do like the like things like job loss or like shifting

employment. I'm like that feels like a

employment. I'm like that feels like a very human social problem and I don't necessarily want Claude to feel paranoid like I also need to solve that like and I'm like maybe that's other people's job right now.

>> Well Amanda, thank you so much for joining us. It's a really fascinating

joining us. It's a really fascinating document. Everyone should go read the

document. Everyone should go read the the Clawed Constitution. Uh argue with it, grapple with it. I found it a very challenging and also a very moving uh read. So uh great work and thanks for

read. So uh great work and thanks for coming. Yeah, thank you so much.

coming. Yeah, thank you so much.

>> Thanks, Amanda.

Loading...

Loading video analysis...