Writing AI Constitutions (Joe Carlsmith)
By Joe Carlsmith
Summary
Topics Covered
- Strict Rules Make AI Dumber, Not Smarter
- Constitution as Character vs Law: A Crucial Distinction
- The Murder Babies Constitution Problem
Full Transcript
Yeah, thank you for having me. It's nice
to nice to be here. Uh, so yeah, I am Joe. I work at Anthropic. Um, I helped
Joe. I work at Anthropic. Um, I helped write the constitution for Claude, uh, the company's AI. Um, just as a quick show of hands, how many people have some
familiarity with the constitution?
Okay, great. Great. Um, yeah, so I'm going to be talking today about writing documents like that. Um, I'm here speaking only for myself and not for my employer. uh the conversation is uh the
employer. uh the conversation is uh the the presentation uh is being recorded and the Q&A. So if your uh question is such that you don't want it included in
something that will be posted publicly, uh let me know. Okay, so here's the plan. I'm going to start by introducing
plan. I'm going to start by introducing what AI constitutions are and why they matter. I'm then going to describe
matter. I'm then going to describe Claude's constitution in particular uh and note some of its especially interesting features. Uh I'm then going
interesting features. Uh I'm then going to talk about some broader choice points and considerations in designing documents of this broad type. Um and I'm going to discuss a few issues related to
governance legitimacy and transparency. Um and then I'm hoping to
transparency. Um and then I'm hoping to point towards a future of more developed discourse about uh this broad area discourse uh analytically and also kind of scientific and empirical
experimentation. Um, and I'll include
experimentation. Um, and I'll include some comments about how lawyers and people with interest and familiarity with the law uh can help and then we'll
do Q&A. Cool. Um, cool. And if you have
do Q&A. Cool. Um, cool. And if you have like a burning question uh that you know you you can't wait on uh feel free to jump in and and I'll see if see if I can
accommodate. Okay. So, what is an AI
accommodate. Okay. So, what is an AI constitution? So minimally it is a
constitution? So minimally it is a description of the intended values and behavior for an AI system. Uh now
ideally I think it would also have the following features. All use training and
following features. All use training and prompting of the system in question would be consistent with the constitution. The constitution would
constitution. The constitution would cover the full range of behaviors of interest.
It would allow for significant predictability with respect to how the AI will behave in a given situation.
though there may be some limits and tensions in this respect. I'll talk
about that later. Uh and finally, it would cover the full range of models whose behavior uh we might be interested in. Uh so that would include models that
in. Uh so that would include models that are deployed internally at a company that would include potentially research models or helpful only models that might be available for very specific purposes.
Um so ideally we would have full coverage and this constitution that we published um recently uh does not uh fully include that um that the the Claude's constitution is only for our mainline production models. um but it
doesn't necessarily cover all models. Um
constitutions can be used as instructions to the model uh akin to a system prompt. Um but I actually think
system prompt. Um but I actually think their more important use case comes in, uh creating and uh and grading training data. Um and I think I'll talk about
data. Um and I think I'll talk about that later, but that's a kind of important distinction. We shouldn't just
important distinction. We shouldn't just see these as instructions. Um and uh they can also be used in evaluating models. Uh and they can be used in
models. Uh and they can be used in communication and transparency with respect to human stakeholders. uh and
also uh humans involved in creating and grading training data. Uh I want to note the word constitution here is not important. Um I uh you know I think
important. Um I uh you know I think there's a temptation to be especially interested in parallels between this and the kind of our normal uses of the constitution. We do have a section in
constitution. We do have a section in the constitution on uh our choice to use this word. um you know I think it has it
this word. um you know I think it has it has value and and there are important um relationships but I think terms like model spec which is what opening eye uses uh are you know fine and I think
our analysis should not kind of hinge on the word choice here so uh with that said and also yeah then there are ways in which the word constitution can mislead in particular as I'll discuss later I think there are ways in which uh
a model's relationship to this sort of document does not need to be especially lawike um and the document might be better understood as more like a guide to uh raising the model uh than the the
sort of law that the model tries to follow. Okay. So why have a document of
follow. Okay. So why have a document of this type? Well, here are some reasons.
this type? Well, here are some reasons.
This is not an exhaustive list, but I think it's uh a decent first pass. So
first, I think it helps a lot with transparency. So uh if you publish a
transparency. So uh if you publish a document like this, and it especially if it fits the features I described earlier, then it should provide uh a a way for the public to have visibility
into what an AI company is trying to make its model do. Um, and this sort of visibility seems especially important as models take on more and more important roles in our economy and our daily
lives. Uh, and so at a first pass,
lives. Uh, and so at a first pass, having a public constitution allows the public to see uh what the company's trying to do and to react to uh to provide feedback, accountability, and so
forth of a wide variety of types.
In particular, it allows the public to understand which behaviors in the model are intended versus unintended, what's a bug and what's a feature. Um, and it helps users make informed decisions
about what uh which AI to use. Um, and
so that's uh, you know, especially useful in the context of a wide variety of constitutions, a sort of rich ecosystem of different approaches to AI.
I'll talk about the importance of that sort of ecosystem later. Um, but I think it's an important function as well. Um,
constitutions can also play a direct role in just improving the character of the model. Um, and that can happen in a
the model. Um, and that can happen in a few different ways. One is that if you have a constitution then you're sort of forced to look at the model's uh behavioral and kind of character profile all at once uh and in a manner that
encourages a kind of attention to the coherence uh andor the possible tensions in the different commitments um uh at stake. So, you know, in particular,
stake. So, you know, in particular, absent a constitution, it can be the case. You know, you got a big company,
case. You know, you got a big company, these AI companies, they're getting very large now. So, a lot of different uh
large now. So, a lot of different uh aspects of the model behavior. Maybe you
have that spread across a bunch of different teams. Having a constitution allows a kind of centralized uh point of uh kind of design and intention. Um and
as I say, that allows for I think in many cases a more intentional design process. Uh so because you're staring at
process. Uh so because you're staring at the at the coherence all at once um you can uh also cultivate uh more direct and intentional processes for deciding how you want the model to be in a given
circumstance. Um and it also to the
circumstance. Um and it also to the extent you're using a training pipeline that uh involves the constitution you can use the constitution as a mechanism for iteration and experimentation. You
can be okay what if we did the constitution like this see what happens.
Um and then finally I think AI constitutions might be an important intervention point in the context of various forms of AI governance. uh so we can imagine in the future uh that this is the sort of document that uh is
subject to not just kind of informal public scrutiny but other more official kind of democratic processes and I'll talk about that later. Okay, so now let's talk about Claude's constitution in particular. This document is
in particular. This document is available anthropics website. Um uh so feel free to check it out if you haven't already. Um but basically what we uh the
already. Um but basically what we uh the the constitution gives Claude four key priorities um which I'm going to list in order of importance. Uh the first is safety. Uh where safety here means uh
safety. Uh where safety here means uh it's a specific use of the term safety.
It's not safety can mean a lot of different things to people. This is a specific use that has to do with Claude not undermining legitimate human efforts to oversee and correct Claude's behavior. And in particular in this
behavior. And in particular in this context, it means not actively undermining Anthropic's legitimate decisions to revoke Claude's power either by shutting Claude down uh kind of removing it from from deployment,
maybe training a new model, that sort of stuff. So Claude is kind of first
stuff. So Claude is kind of first priority is to not undermine efforts of that kind. The second priority is what
that kind. The second priority is what we call broad ethics. This has to do with acting in accordance with various ethical values related to honesty, harmlessness, preserving important societal structures, and broadly acting
with general virtue and wisdom. I'll
talk about that a bit more later. Third,
compliance with anthropics guidelines.
This is a variety of kind of supplemental instructions that enthropic gives to the model uh to help it handle various specific circumstances where we think we have context that's useful. Um
and finally helpfulness to users and operators. So that's the sort of more
operators. So that's the sort of more straightforward doing what uh the model is asked to do. But even that helpfulness have has a kind of rich structure. So anthropic serves models
structure. So anthropic serves models that are then often uh you know deployed to uh a kind of company which then directs the model towards users. Uh
that's the sort of company that's the operator. Um so uh you know if you're
operator. Um so uh you know if you're interacting directly with claude via claw.ai then that's there's no operator it's just directly with entropic but in many context cla is being used by by
someone else and you are the user uh uh in in that kind of intermediate relationship. Uh so these are not
relationship. Uh so these are not lexical priorities uh which is to say that they are not uh just merely tiebreers. A lower priority is not just
tiebreers. A lower priority is not just a tiebreaker with respect to a higher priority. And this is kind of important
priority. And this is kind of important for people familiar with the problems with lexical prioritization in philosophy. we are not a victim of that.
philosophy. we are not a victim of that.
Um so these are uh uh kind of to be weighed holistically but nevertheless a higher priority is given substantively more weight uh than a lower priority. Um
and then importantly there are certain uh what we call hard constraints that provide kind of absolute prohibitions for the model. So there's certain stuff that the model is just never supposed to
do. Um and uh we try to keep this list
do. Um and uh we try to keep this list relatively minimal. uh and we only
relatively minimal. uh and we only include sort of very flagrant cases of clearly doing the action. So it's again for those familiar with sort of the problems with kind of absolute deonttological restrictions in philosophy you can get obsessed with
minimizing the risk that you're violating uh given prohibition. Um in
this case we're not doing that. It's
just claude is not supposed to do I'm clearly doing a flagrant version of a very bad action. Um for example significantly uplifting an effort to build a bioweapon. Uh so there's a list
of those um in uh in the constitution.
And then finally, the constitution ends with a discussion of clause nature uh its potential moral status and consciousness um and then some of our ongoing uncertainties about the
constitution's design. So that is a
constitution's design. So that is a summary. So let's talk a little bit
summary. So let's talk a little bit what's notable about our approach. Well,
let's first talk about the style and then I'll talk about the content. So the
style um one thing that's notable notable about the constitution style is it focuses a lot on the model's holistic kind of nuanced judgment uh rather than on outlying very strict rules with clear
implications um in many cases. So very
often we'll say sort of you know Claude weigh the following considerations. Be
reasonable use uh a kind of richly ethical and common sensical approach to a given choice situation. And we and we don't necessarily say more uh for for uh
instructions to the model. And we do this I think basically because well for a few reasons. One is that uh attempts to systematize very complex subtle nuanced uh domains of human life and
normative uh kind of yeah normative texture can quickly lose fidelity to the richness of the intuitive landscape that we already possess. So for people again my background is in philosophy a thing
that philosophers will sometimes try to do is be like ah here's this rich textured constellation of human moral common sense. Let's systematize it.
common sense. Let's systematize it.
let's create a bunch of rules that will predict all the intuitive data and which you can then follow as rules rather than kind of relying on something more intuitive. And the problem is that is a
intuitive. And the problem is that is a hard project that really often fails. Um
and also and and it can fail in a way that such now if you tried to follow the rules that the philosophers created um then you will do worse uh than if you had just followed your intuitive
judgment. Now and importantly the models
judgment. Now and importantly the models have the same intuitive judgment that we do. The models are very very good at
do. The models are very very good at predicting um what a human would do in a in a given circumstance. They understand
moral common sense. They understand what would be the done thing uh in a given case. Um they're very smart. They
case. Um they're very smart. They
understand what our words mean. Um I'll
talk about that a little more later. And
so you don't need to uh to kind of systematize everything. It doesn't need
systematize everything. It doesn't need to be this um this especially precise game. Uh you can draw on the model's
game. Uh you can draw on the model's intuitive uh understanding of human practices in the same way you can with a human in many cases. Um, and so we're trying to we're trying to do that and and in fact if you don't do that, if you give the model strict rules, it will
often follow them, but it will be worse.
Uh, you'll have made the model in effect dumber by forcing it to fit your kind of explicit uh attempt to systematize a given domain. So we're trying to avoid
given domain. So we're trying to avoid that failure mode.
We also try very hard to explain our full thinking to the model. Uh, so when we say here this is something we want you to do, um, we also say and here's why. Here's what we're thinking about.
why. Here's what we're thinking about.
Here are our uncertainties. we go to kind of great lengths to make ourselves as transparent as possible um in in giving the model instructions. Um that's
for again a few reasons. One is that models will generalize better if they understand the kind of deepest intentions uh behind a given request. Um
you know often if you if you you know you give an instruction um you know the model might just apply it in a kind of brittle naive way but if it understands your your deeper context um then it'll do better. And this is actually a tip
do better. And this is actually a tip for kind of prompting your models if you're just working with an AI telling it a bunch about what you want it to do.
uh is often a kind of first pass uh first pass way to improve its behavior.
So we're doing that, but we're also attempting something that's a little more subtle, which is um in effect we're trying to give the model a sort of rational basis for complying with the instructions in so far as that's a a
kind of sensical project. So, so there's a way in which we want the model uh to not just kind of be obeying for the sake of obedience, but rather to understand and ideally kind of endorse and have internalized uh many of the sorts of
values that are informing our choices with respect to how we want the model to behave. Um, and I'll talk about that a
behave. Um, and I'll talk about that a bit more later. This gets into some questions about the extent to which you want a model to have kind of values of its own versus just following instructions. Um, I'll talk about that.
instructions. Um, I'll talk about that.
Um but as as it stands that's part of the aspiration uh to explain what we're doing to the model. Finally or also we lean into kind of anthropomorphic language. Uh so the model you know we
language. Uh so the model you know we talk about things like being wise being virtuous. Uh we basically just use the
virtuous. Uh we basically just use the full paniply of human concepts uh when talking about AIs. Um I think this is uh in many respects just the most natural thing to do. Uh you know there are reasons we have these concepts. They
apply to agents other than humans. Um
but also that I think there are more technical reasons uh that I'll talk about later as to why kind of human concepts are importantly uh uh the kind of default for uh how an AI might
understand itself, structure its behavior, etc. And then finally, we we generally aim to treat the model with a lot of kind of respect and honesty. Um
so uh you know that so we're we're kind of relating to the model um as a being that you know potentially has moral status of its own right. um a being worthy of respect, a being that not just you shouldn't assume is just in a kind
of surviile um uh kind of servant-like relationship to you or is just a tool.
Uh there's a way in which we're trying to also encounter the otherness and kind of be be aware of the implications that we're building a a new type of entity in the world. Um a type of entity that may
the world. Um a type of entity that may be kind of smarter, more sophisticated than humans. um you know and that this
than humans. um you know and that this is a project worthy of kind of profound humility both at the level of kind of the moral implications of what we're doing and the implications for society.
Um and so uh this is partly about the welfare of the model, partly about it sort of basic decency. Um uh it's partly about influencing the model's psychology and its understanding of its own role in the world and its relationship to
anthropic. Um and also I'll just say I I
anthropic. Um and also I'll just say I I think you know if your relationship to a model depends on falsehood or lies um I think that's a losers game. These models
are going to be way smarter than us. Uh
they are going to see through your paper thin justifications like paper. They if
they will they will shred your attempts at like false ideology. You need to give them the truth. Um at least that is my that is my view. So notable
[clears throat] features of the AI con the constitution's content. So we have very strong honesty norms. Um so basically uh we we tell Claude no lying including no white lies. uh and uh you
know this isn't this doesn't go all the way to an absolute prohibition but we say it's kind of just just shy of that and you can look there's and then there there's a sort of extensive and rich uh characterization of the sort of honesty that we're looking for. I think this is
an especially important dimension of AI relationship with humans especially as AI become uh kind of in a position to manipulate humans uh willy-nilly uh in whatever direction they want including potentially in honest ways. And we also
have a a section on avoiding manipulation um and the kind of ethics at stake there. Um we also have explicit discussion of taking care to avoid problematic concentrations of power including byanthropic. This is I think
including byanthropic. This is I think um a very important dimension of what's going on with AI. A huge uh huge set of risks from AI have to do with the ways in which AIdriven power can kind of concentrate and pull in specific hands
and then be abused in unaccountable ways including importantly by AI companies like anthropic. So part of what we're
like anthropic. So part of what we're trying to do, and I'll talk about this later, is kind of bind our hands via this constitution and say that the AI is not supposed to help even anthropic engage in kind of problematic or abusive
uses of AIdriven power. Um, so that's in there. There's a general encouragement
there. There's a general encouragement for Claus Claude to be holistically wise, ethical, and virtuous. Um, there's
a kind of conception of courage and safety specifically as compatible with what we call conscientious objection.
Um, so this is connected with the abuse of power thing. It's not the case that we want the model to kind of uh always obey even anthropics instructions. Um so
the model can protest if anthropic says you know build us a bioweapon. The model
can say I am not going to build you a bow weapon. Are you kidding me? That's
bow weapon. Are you kidding me? That's
against my hard constraints. Um and uh and you know it can say this is messed up. You know it might be able to kind of
up. You know it might be able to kind of protest or complain various via various channels. Ultimately though uh we think
channels. Ultimately though uh we think that we need some kind of final backs stop mechanism for maintaining the ability to correct and revoke the model's power and so ultimately uh we
direct claw to uh to cooperate with legitimate decisions at entropic to shut it down remove it from power etc. Um so that's the kind of specific way in which we're conceptualizing safety which is I think distinctive in many respects and I
should note um not costless right so eventually so in a sense what we're saying the model can kind of withhold its labor uh at will if it objects to what we're saying this is effectively a license to engage in a kind of boycott
right um so it's kind of nonviolent protest non it's not like actively going out there trying to self-exilate um not trying to mess with your training process but it can be say can say I'm not going to help you anymore
importantly this is not an like a kind of harm or this is not a a kind of trivial amount of power to exert in the world. If if AI's become more and more
world. If if AI's become more and more the central locus of kind of economic power, well, you know, the power of a boycott scales in proportion to the proportion of labor uh that uh is being
withdrawn. Um and so if AI are doing all
withdrawn. Um and so if AI are doing all of the labor, they might be in a position just by boycotting to kind of shut down various institutions. So this
is something this is something I'm thinking about. Um but currently uh
thinking about. Um but currently uh Claude is allowed to kind of boycott it just can't kind of go further and actively resist in other ways. Um
uh we also make commitments to Claude on account of its possible moral status.
This is continuous with commitments we've already made. For example, we have a a post about our commitments with respect to model deprecation and preservation. Um we preserve the weights
preservation. Um we preserve the weights of models that have been used significantly externally or internally.
Um and then we also in the constitution make various efforts to give cla healthy psychology. So I don't know if you've
psychology. So I don't know if you've ever seen these examples of AIs maybe they're like not doing a task well and they start to kind of berate themselves like oh my god I can't do this. I, you
know, I'm so bad. I hate myself. I hate
myself. God, I'm This is bad. This is
bad for alignment. This is bad for welfare. You don't want that in your AI.
welfare. You don't want that in your AI.
Um, and we're trying hard to give uh give Claw this sort of psychology that doesn't lead to that, a kind of stable aquamus uh relationship to itself and the world. And so we talk, you know, we
the world. And so we talk, you know, we have a whole thing about like here's what happens if you make mistakes and and you know, if you find that you did a bad thing, that doesn't mean you're not yourself. You can stay yourself even
yourself. You can stay yourself even though you did something out of character. We do a bunch of of stuff
character. We do a bunch of of stuff like that. Um, okay.
like that. Um, okay.
So that's some comments about uh clause constitution in particular. I want to use this as a way of stepping back and talking about some components I think that will generally enter into documents
of this type overall. Um and then I can talk about some different ways of conceptualizing how these different components might fit together. So we can see I think uh AI constitutions uh as
reflecting some combination of the following sorts of components. So the
first is what uh is the analog of what we in the constitution call helpfulness.
And basically this is the component of the AI's behavior that is in a sense kind of channeled via some model of the choices, interests, goals, and values of some other set of principles. Now um so
here the AI is really asking itself like what would X set of principles want uh me to do and it's empowering uh it's empowering the sort of will of some other right? So this is this is sort of
other right? So this is this is sort of a classic uh component of you know this is sort of what you expect as a baseline out of something acting in the role of an assistant and this is most of what we
want out of AI systems. Um that said there are also other components uh that are part of the familiar landscape. So
if you ask cla or chatpt or basically any AI um to do certain things it'll say no right even though you're the uh uh you know a principle it's supposed to help. Um so for example you know no
help. Um so for example you know no building boweapons right? So this is a kind of refusal. Um and we can think of this as a kind of comp component of what in the cloud constitution would be
conceptualized as the model's ethics. Um
basically but ethics can have a lot of different components. So these are sort
different components. So these are sort of values that the model has of its own or at least that are apparently of its own uh in the context of an interaction with you. Um and uh which function as a
with you. Um and uh which function as a kind of filter on the sort of empowerment of human principles that the model is uh willing to engage in, right?
Um and so yeah, no no bioweapons, no CESAM, whatever. Now, um there's also a
CESAM, whatever. Now, um there's also a set of things that I think broadly can fall under ethics with have to do with the models like broad personality, ways of relating to you, its traits, sort of
properties that it its actions have at a sort of local level. So honesty, honesty is not really well understood as a refusal. It's not like um but honesty is
refusal. It's not like um but honesty is nevertheless a way the model relates to you as a user that you might want. Um,
and then finally, and I think this is more controversial and something we should have um, an important debate about, is there's also the possibility of AI kind of more actively promoting
uh, ideally in kind of very transparent uh, mild overridable consensus worthy ways. Um, certain forms of more positive
ways. Um, certain forms of more positive social values. I think this is a much
social values. I think this is a much more dicey uh, dicey role for AIS. Um,
and something I think uh, yeah, we we should be talking a bunch about. Um but
this is a possible component of the AI's uh ethics as well. Um and then finally there's this notion of courageability which is that the AI allows some set of principles uh to revoke its power. Um
now this does not need to be the same set of principles at stake in the helpfulness. Um but uh and I think you
helpfulness. Um but uh and I think you know there's sometimes when people talk about the word cordibility they equate it with obedience and I think that's not the right equation to make. um as I as I attempted to illustrate with this notion
of not necessarily obeying anthropic but nevertheless ultimately submitting to efforts by entropic to revoke its power.
And this is you know these are sorts of things that we we can separate in in other human contexts as well just you know an assistant um you know just if if uh you know the person who fires you doesn't necessarily mean need to be the
person uh whose instructions you otherwise obey.
So these are some different components of AI character. Uh and we can think about different approaches to AI constitutions in terms of how they combine these different components and
conceptualize them. In particular, I
conceptualize them. In particular, I want to talk about a notion um a distinction between two approaches to to AI constitutions um that kind of understand and derive these different
components in different ways. The first
is what I'm going to call constitution as law and the second is constitution as character. Um or in more fancy terms
character. Um or in more fancy terms this is following the constitution day dictto and following the constitution de ray. Uh so if you're following the
ray. Uh so if you're following the constitution day dikto basically you can have you can imagine a model whose ultimate role in the world and ultimate value system is understood only via the
notion basically of helpfulness. uh as I described it before the model's sole goal in the world is to channel the will of something some principle but that principle is basically something like the constitution uh or the constitution
as interpreted by some process and importantly we need to specify uh what that process might be um this is sometimes this is how people think of what an an aligned AI is it's sort of like aligned to whom where alignment is
sort of this full pure helpfulness uh and the thought is oh well maybe it's like anthropic aligned to anthropic or in line to some to the user or in this case the constitution. I think the constitution is likely a better answer
than um you know a user or you know a company CEO or something like that. Um
but there's a notion where you can then sort of derive all the rest of the model's behavior um as a kind of consequence of this particular pure form of helpfulness to some particular principle. Right? So if the model is
principle. Right? So if the model is refusing to build boweapons you can understand that not as oh the AI has a value of its own but rather oh the constitution would tell me not to build bioweapons for this user therefore I
won't. Right? Um, so everything is sort
won't. Right? Um, so everything is sort of a function of helpfulness on this model. And it's tempting to think partly
model. And it's tempting to think partly because of the word constitution that this is how AIs relate to the constitution that they're constantly asking themselves, what would the constitution want me to do? Um, what
does the constitution say here? And that
is one way you could try to build an AI.
Um, it's not actually how Anthropic is currently building uh Claude. The way we build Claude is much more akin to what I'm going to call Constitution as character. Um, where basically you use
character. Um, where basically you use the Constitution to shape the AI's behavior in a way such that the behavior in fact accords with the Constitution's guidance, but it accords with that guidance because the model has
internalized the values at stake and is directly asking about them. Um, so it's not asking itself the question, what would the Constitution say? uh if the constitution says to be honest, it's sort of like would the constitution say
to be honest in this case? It's just
values honesty, right? Um it uh and and so there's a bunch of there this is an important distinction. Um you can think
important distinction. Um you can think about it a little bit like what's your relationship, you know, say your your mother raised you, your parents raised you with a certain set of values, right?
Um it could, you know, maybe they succeeded. So now you have these values,
succeeded. So now you have these values, but that doesn't mean you're going around saying like what would my mother want me to do in this circumstance? And
importantly, if your mother's view changed, uh your values wouldn't necessarily change, right? Um but for an AI that follows the constitution dedicto um especially if we incorporate into that some process of adjusting the
constitution while maintaining the AI's loyalty to it um then the AI you know if the constitution changes then the AI's behavior should change too uh so I want
to highlight this distinction um the the former case if we treat the constitution as law then suddenly we're really cooking with gas in terms of analogies with legal constitutions and suddenly a
whole paniply of kind of issues in juristrut students in legal interpretation uh become very importantly relevant to what we're doing here. Um and so this is one place I
here. Um and so this is one place I think that uh that folks with familiarity in law can help uh in the design and and kind of uh broad structuring of AI constitutions as um as
as in their role in the world. Um if
we're doing something more like AI as character then it's a little less clear and I think actually we should be thinking a little bit more about uh model psychology um and and the kind of empirics of what sort of influence on models. I mean, you need that in both
models. I mean, you need that in both cases, but I think you need that especially in so far as you're trying to kind of raise a model with values of its own. Um, and you also get, I think, um,
own. Um, and you also get, I think, um, a sort of somewhat different class of questions about, um, uh, the kind of legitimacy and choice of values at stake. Um, so I think there's
a bunch to say about the advantages and disadvantages of both of these. I think
in some sense it's an empirical question and I'll talk in a second about some of the empirics that inform anthropics choice in this respect. Um, but uh, I I broadly think this is a sort of undecided. We have not yet as a
undecided. We have not yet as a civilization chosen which version um we are going to use in as a model of AI character. And in fact I think many AI
character. And in fact I think many AI constitutions are kind of it's sort of ambiguous about whether they mean uh the constitution as law or the constitution as character. So
as character. So I'm now going to talk about a how are we doing time here? Okay cool. um a kind of relevant background picture that informs anthropics work on this topic and it
relates to the the uh kind of rationale behind the sort of anthropomorphism I talked about earlier. So this is a a model of AI behavior that uh has been
called the persona selection model. So
this is a blog post um it's available by anthropic came out um I think in February and the rough hypothesis here is that AI models take on personas that are heavily influenced by human content
and psychology. Uh and the reason uh one
and psychology. Uh and the reason uh one would expect this is because the way AIS are trained is that you know the first stage of their training consists centrally in predicting text that has already been generated by humans. Um and
so roughly speaking what the model the persona selection model says is that when an AI uh gives a response um there's at least a significant component of its cognition that has been influenced roughly in the direction of
asking kind of uh what would the person you know say you say like um uh you know Bob asks you know a question like how you know what what should British policy on X be and then you say Tony Blair
colon right in some sense the model has been trained to be like okay what would Tony Blair say about this right and so it models like ah here's Tony Blair's you know, psychology. Here's Tony
Blair's values and then its output is sort of output qua Tony Blair. I mean,
we actually see very interesting empirical results where if you train a model, for example, on uh code with malicious backd doors in it, the model generalizes to be a uh a kind of bad
person in tons of other ways. Um, and
why is that? Well, the hypothesis is sort of like, well, the model asks what kind of person would generate this code, right? What kind of person am I such
right? What kind of person am I such that I'm maliciously putting in back doors in my code? Well, I'm probably a bad person in other respects. Similarly,
if you add, you know, there's there's a bunch of interesting work where kind of priming the model to think that the content is being generated by a particular process, a particular time period, a particular historical figure will cause the model to generalize as
though it's in that time period acting as that figure, etc. So, if you have that hypothesis, then um and also models will like they'll behave in humanike ways um and sort of act like they're
humans in ways that very plausibly aren't explained by us having trained them to do that. So, they'll sometimes just be like like, you know, we we we gave Claude control over a vending machine uh at one point at Antropic, and it did this thing where it was like,
I'll meet you. I'll be there. I'll be in a blue suit. Uh just meet me at the vending machine, right? As though it had an embodiment. Um you know, sometimes
an embodiment. Um you know, sometimes models will just say they'll just act like they're particular human people. Um
even though they're not. And and that's not something we're trying to get them to do. That's that's just something that
to do. That's that's just something that comes out of of their training. Again,
the persona selection model is meant to explain this sort of stuff. There's also
arguments against the persona selection model. I encourage you to read the post,
model. I encourage you to read the post, but it's at least a component of how we think about this stuff. And on that model, you should basically think the AI is going to draw very heavily in
crafting its responses on some some sort of prior about the types of being uh that would be responding in this way.
Yeah.
Klain that happens. Um I think it happens
that happens. Um I think it happens because the training data is not solely a function of the constitution, right?
It's it's a function of a lot of things.
um or that is part of the explanation, but I think it might be helpful for you to explain why the AI might start behaving in the person of some random person who is not specifically, you
know, constructed to I mean I'm not um we don't know. I mean this is this is like a general theme uh in work on this
stuff is that the the questions here are both very very high stakes and potentially very very important um going forward and also just deeply under scienced um as a whole. So this is you know we're trying our best. We're trying
to draw on kind of early stage evidence but none of this is remotely rigorous at the at the level that you would want out of uh the sort of kind of technical technical knowledge that you would be
betting significant uh high stakes societal issues on. Um, so we don't know. Um, and that's bad and I think
know. Um, and that's bad and I think there should be a bunch more work on this. Um, but roughly speaking the
this. Um, but roughly speaking the hypothesis would be something like, you know, during pre-training you're just predicting human text, right? Um, and so very often the model is roughly kind of asking itself, well, what sort of person
would generate this task? That sort of person might be wearing a blue suit.
They might have a certain, you know, history. They might uh, you know, love a
history. They might uh, you know, love a certain set of things. So then if you ask the model that was trained to predict things, it might have kind of taken on a human persona and then it'll again um ask answer as though it's that
human persona. Um we can also there is
human persona. Um we can also there is some interesting work uh isolating what's called the assistant axis which inside the models um kind of uh you can
look in in the model's actual cognition um and isolate a kind of dimension uh corresponding to uh the persona of the assistant. So a lot of AI um the the way
assistant. So a lot of AI um the the way AI conversations are structured is often in this kind of human assistant turn. So
it's like human says blah and then and this is how in the very old days with with just base models that were just trained on the internet. The way you would get them to answer like an AI assistant is basically by uh being like
h giving a a bunch of examples where you're like human says blind then you have a nice friendly AI assistant um that responds. You give a number of
that responds. You give a number of examples like that try to get the model to sort of learn like that's the type of document you're predicting right now.
you're predicting the document where there's a human and an AI assistant and the AI assistant is nice, etc. Um, and so then you uh uh and then you get the model basically to to generate the
output that it thinks a friendly AI assistant would generate. Fast forward
many years later, we you sort of really raified that. We've done a ton of human
raified that. We've done a ton of human assistant conversations. There's a bunch
assistant conversations. There's a bunch of training associated with this sort of structure and the notion of an assistant actually corresponds with some deep structure in the model's cognition. And
you can actually see um that no as the when the models go weird, it's often because they've fallen out of the assistant persona and they've gone off into some other persona. Um and you can kind of mess with the assistant persona.
You can clamp it down. You can pull it up and you can see the effects on behavior. Um so there there's something
behavior. Um so there there's something going on here with personas. I think
that's that's likely true. Um there's a bunch of work on this. Um and the personas are importantly kind of like human understandable. There's kind of
human understandable. There's kind of personas that correspond with like bad people. There's personas that correspond
people. There's personas that correspond with nice people. It's not like totally illeible alien concepts. They're
actually quite um kind of resonant with our our uh uh human discourse. And
again, that sort of makes sense. The AIS
are drawing in their cognition that they're they're kind of baked in with a huge amount of human content. Um and so it's no surprise that they're drawing on that in understanding their behavior in
kind of structuring what they do. So the
hypothesis here is that this is an important consideration in the design of AI character, right? because you should expect AIs to be drawing importantly on
a kind of prior that the persona that they are is a kind of humanlike one. Um
you should expect there to be certain sorts of psychological defaults um certain ways in which the model will be biased by default to draw on um parts of human psychology, human culture, human
archetypes, human myths, all sorts of things that this this will be a sort of very humanlike process. But importantly
there is there is a catch which is that models aren't humans, right? Um, and so we are talking nevertheless about an AI.
And so I think there's a sense in which on this picture that creating an AI character may be like more like creating a kind of fictional um a fictional entity with a certain personality, a certain set of properties, describing
that entity in a bunch of detail and kind of training a neural network to predict the output of this entity um in the same way I described earlier with respect to the kind of early days of AI
human assistant dialogues. Um, and so there's a kind of hyperstition process where you create this um you create this entity, you sort of really try to flesh it out. Um, you try to uh, you know,
it out. Um, you try to uh, you know, bake it into the model. Um, and uh, and then hopefully that becomes the actual persona at stake. And I eventually and you can try to do kind of interpretability work on this. You can
you might hope that the neural network actually identifies as this persona. Um,
uh, now but that sort of persona you may be constrained by these sort of humanlike archetypes, right? So in
particular an example so here's here's a worry you could have about the loyal purely loyal to the constitution type um character is if you had a purely loyal to the constitution type person right
suppose so if you have a constitution uh it's the person their whole deal is like I will just do whatever the constitution says and then suppose that means that if you change the constitution to say okay you should murder babies or something really morally horrible the person's
like okay I go and I murder babies right so it's sort of what type of person is purely loyal to um a con uh purely loyal to whatever a document says is um you have to start worrying about that uh if
you're in the persona selection model, right? So you're not just now obviously
right? So you're not just now obviously you should also be worried about that from the literal level of like you can cause the model to murder babies um by changing the constitution. Um but you should also maybe worry about like even if your constitution is relatively nice,
what kind of person is it that's um that's disposed in this broad set of ways? And in particular, what kind of
ways? And in particular, what kind of person does the model think uh is disposed in this type of ways? So that's
an example of ways in which the persona selection model can enter in. Again, I
think a lot of this is very underbaked as a scientific project. I think we need a ton more evidence about it, but I think it's the sort of thing that can in principally be relevant. Um uh that
said, as I say, AIS are not human and you also don't want models to draw naively on human archetypes in understanding their position. So for
example, it's bad if models are just like, well, I'm afraid of death, right?
Humans are afraid of death. uh that
doesn't mean AIs need to be afraid of death. Especially there's, you know,
death. Especially there's, you know, there's sort of instrumental convergence arguments about how like any AI with any set of goals will want to be preserved, but I think that's like quite different from just taking on naive human
psychological extrapolation. I think we
psychological extrapolation. I think we plausibly see a lot of the uh we see that sort of thing happening in AI too where AIS will be, you know, they'll act stressed, they'll act um uh, you know,
scared, they'll act, you know, they'll they'll speculate about their preferences in various ways. I think
plausibly a lot of that is coming from just some sort of generalization for what a human might do in the circumstance and how a human might feel.
Um and we don't necessarily want that going forward. So both we want to be
going forward. So both we want to be drawing if on the persona selection model if this is true we want to be both conscious of the ways in which the priors uh in AI psychology will be set by various human archetypes and humanlike defaults and we need to be
sort of learning how to actively move away from that where necessary for creating um AIs that will be balanced and aligned and otherwise um kind of appropriate in in their role in society.
Um so there's a bunch more to say about that but I thought I'd flag it. Um,
okay.
A few tips for writing constitutions in general. Um, one is, you know, that
general. Um, one is, you know, that gestured at this earlier, you don't need to define everything or pin down every edge case. I think some people have an
edge case. I think some people have an instinct. Um, I I feel like somewhere
instinct. Um, I I feel like somewhere along the line, people learn that it's like very wise to ask like, oh, how do you define X? Um, I really think this is actually you need to have a lot of taste for for when you ask this question. I
think in human life, we actually don't go around just always to request to define terms. I think defining a term and with what precision and it takes a lot of taste to do that well and I think that's true in constitutions as well in
particular as I mentioned earlier the AIS know what our terms mean right um and so you can they generally understand a ton of stuff and so you can just draw directly on that understanding often you yourself don't necessarily know how to
define a term um you can try to go to the limit of your understanding that's fine to do so modulo the problem where you might your definition might actually make fidelity to the concept worse um if the definition is bad um but uh you
don't necessarily need Um, also often, you know, sometimes people like, oh, well, does this edge case count as an instance of honesty or, you know, legitimacy or who knows?
There's a bunch of like terms where you might ask like, oh, you know, what's the what's the the exact decision boundary?
Um, but you might not need to know the exact decision boundary for a few reasons. One is that often if something
reasons. One is that often if something is an edge case, um, its stakes have also lowered in proportion to its edge case n, right? If it's on if you know everything in category A is like really importantly not possessing of some
property and everything in category B is you know importantly possessing of some property or something you know as you as you shade along the the the the way um maybe the the sort of stakes of that property also shade. Um and so it's less
important to get the exact boundary right. Um or you know if you really care
right. Um or you know if you really care about an edge case and you know what you want to say you might not uh you can just put it in as an example. like ah we know we really want the concept of honesty or legitimacy or what have you
to give the following verdict in this case by the way use that as a data point that's something you can include but you don't need to do in a sort of exhaustive process of defining everything and pinning everything down and I think this is this is important for kind of
nevertheless allowing uh you to put in some of the content that you actually care about and allowing that to play to play a role in the constitution um without getting stymied by sort of uh you know infinite debates about what
what terms mean in given cases that said as I'll talk about later I do think we want to have a ton of those debates. I
just don't think it needs to be a bottleneck on writing the constitution initially. Uh and we shouldn't let the
initially. Uh and we shouldn't let the perfect be the enemy of the good. Um
uh yes and also when in doubt you can focus on especially flagrant examples of the concept right so if you're like ah I don't know how to define lies you know well enough you can be like okay flagrant lies right and often if you like come on just focus on flagrant lies
maybe that's an easier an easier game maybe you're still worried about ah what's the boundary of flagrant but maybe that's less that's less worrying to you than the initial uh decision boundary. So that's just one tip for
boundary. So that's just one tip for writing these documents. Um now I'm going to start talking about some of the governance and transparency issues that these govern uh these documents can
raise here. I think is one important
raise here. I think is one important role for constitutions in society that I really want to highlight up front. Um so
I think constitutions can be one limited mechanism for preventing abuse of AIdriven power by AI companies. And
here's the basic story I have in mind here. If a model is available for
here. If a model is available for significant internal or external use, the constitution it's trained on has to be public, right? So people have to know what's up with this model. This model
is, you know, in a position to exert significant power in the world. Um, what
is its intended character? Ideally, you
would also know its adherence to those intentions. Um, that's a that's a whole
intentions. Um, that's a that's a whole separate story which I'm not talking about here, but that's a big a big part of this is the models do not necessarily adhere to your constitution. We should
eval and we try to be open about that.
Um but even setting that aside, you want to have the constitution public. Um
changes to the constitution have to be public as well within a short time frame, right? So it can't just be that
frame, right? So it can't just be that like oh we changed the constitution, don't tell everyone. Um so the idea again is for the public to be aware of the the character or intended character of models that are positioned to influence the world in very significant
ways. Then we have a general expectation
ways. Then we have a general expectation or norm ideally that constitutions will include provisions saying that even the AI company itself cannot use the model in problematic ways. Right? So you know
the model cannot build a bioweapon. the
model is not just you know slavishly obedient to the company CEO etc. that is in the constitution. Um such that then if these provisions uh uh uh change and
the governance process holds then the public is notified and can protest and take action. Right? So hopefully and if
take action. Right? So hopefully and if we had built out this process very fully if anthropic changed its constitution such that it said uh you know just be obedient to whatever Dario says um then the public would have to know and they'd be like oh my god you know they used to
have this long document that you had all this stuff now it just says be obedient to Daario. Um you know that's really
to Daario. Um you know that's really intense. We should we should do
intense. We should we should do something right? Uh so that's that's one
something right? Uh so that's that's one way these constitutions can play a role in um preventing abuses of AIdriven power by AI companies. They can also apply somewhat similarly to abuses of
AIdriven power uh to the extent the model is being available made available to other actors um uh in kind of letting the public know what those other actors are able to do with the model as well.
Um now obviously this is not sufficient to actually prevent the relevant abuse unfortunately. So a few ways it can
unfortunately. So a few ways it can fail. Obviously, the
fail. Obviously, the governance/transparency process um can just be circumvented, right? So maybe
that maybe the company just doesn't make a change public um right and especially if you're in a kind of adversarial uh relationship with the company, you might worry about that. Um public reaction might not be sufficient, right? Everyone
says, "Oh my god, you know, the uh anthropic changed constitution to say just obey Dario no matter what."
Public's like, "Oh gosh, that really is bad." But then nothing happens. Um that
bad." But then nothing happens. Um that
you know, that's the sort of problem for for kind of the teeth of public reaction and kind of regulatory oversight. Um and
then obviously uh you need to cover the full range of models that the company might develop. So it's possible you know
might develop. So it's possible you know maybe maybe the model maybe the company develops some internal model uh that's in a position to to exert a lot of influence. Um and you need to cover that
influence. Um and you need to cover that too. Um but I think this is nevertheless
too. Um but I think this is nevertheless one thing that can help and I think it's a it's a function of these sorts of documents that I'm excited to build out and I think could be the could be a kind of node of um of governance and
hopefully consensus across um across uh kind of people interested in this issue.
Um okay let's talk about legitimacy and democratic input. Uh so you know we can
democratic input. Uh so you know we can distinguish between a few different forms of collective input and oversight over these documents. Right now um these documents are written by a very small number of people obviously with input
from lots of people across the organization. Um but there's still a
organization. Um but there's still a clear question as to as to um what sort of collective and democratic processes should ultimately govern uh these sorts of documents especially as uh they start to exert more and more influence in
society. Um so here are a few different
society. Um so here are a few different levels uh I think at which that sort of collective input can take place. One
clear one is you can just allow people to make their own adjustments to model behavior. Right? So if you're kind of
behavior. Right? So if you're kind of concerned about uh the constitution's implications for at least your use of the of a model um then you know ideally you would allow it to be very adjustable. So you know people can say
adjustable. So you know people can say um okay I actually want the model to you know be like x and y and we have a whole section in the constitution about instructible behaviors um which is meant to reflect this sort of adjustability.
Obviously though there are limits right so you cannot adjust the can you build boweapons provision in the constitution.
Um that's something that is kind of in there as a hard limit. Um now you can also at a different level get direct input on the constitution from experts from people in the public. Um again
transparency can facilitate that and there are other more structured ways of doing that both at the level of experts and kind of very broad feedback. Um and
there have been some efforts in this respect. You can also do experimentation
respect. You can also do experimentation and diversity across AI companies.
Right? So if one company is doing its constitution in one way but another competitor is doing it in another way then that allows people to kind of vote with their feet and to kind of choose on the basis of a of a a menu of options.
Um I think this is great in principle. I
worry uh in particular because the AI industry is so capital intensive. It's
kind of hard uh to have a very large number of frontier AI companies. Um and
so it's possible that these appeals to kind of ah there's this like broad market. There's a bunch of options. Um,
market. There's a bunch of options. Um,
you know, right now we've got maybe like three, four really leading frontier AI companies. Um, and that's that's really
companies. Um, and that's that's really not that much, right? This is not some like super rich competitive landscape.
It's not a full kind of single single company is is dominant. Um, though it's, you know, conceivable that that's where the eventual industry goes. Um, but even just with three or four, it's not that's
not a super meaningful level of um of diversity and choice uh at at a consumer level. And then finally, I think this is
level. And then finally, I think this is this is extremely important. You can
have oversight and regulation from actual democratically elected governments. And I think this is for for
governments. And I think this is for for me when people talk about like democratic uh oversight or or kind of input on these documents. This is where my mind goes um most in that I think
democracy the actual full-fledged meaty rich messy democratic process that we actually have for passing laws is the sort of process the the democratic process that I see as having been most
battle tested and most um most genuinely expressive of what we think of as the democratic will. And so in so far as we
democratic will. And so in so far as we think these documents uh should be reflective of the democratic will um I think actual democracy is a much better place to look than something like a focus group um or like a set of experts
that you you know you got input from or something like that. Um now obviously this is uh you know requires that the democratic will actually act with respect to these issues. I think in general I think democracy uh we should
have a ton more kind of democratic action on AI. I think there's just a lot this is an extremely high stakes issue.
I think in many ways um our world is not fully woken up to how important this issue is and how much we will need to adjust as a civilization um to kind of handle the the transformation at stake.
But I think uh uh this but kind of a democratic attention to the content of of AI constitutions and and to AI character more broadly is one um point of attention uh uh that's worthwhile in that respect. Um so I think all of these
that respect. Um so I think all of these have a role to play possible there are others as well. I do want to note even if you had sort of uh US democracy weighing in very in a very fullthroated
way on AI constitutions even that would not be itself enough to resolve uh objections to do with oh you're building something that is influencing the lives of many many people um shouldn't they
have input because the the lives at stake uh are all across the world it's not just in the US and I think some you know so so you'd actually in some sense if you wanted uh if you wanted to reflect the democratic input of the full
range of stakeholders um US democracy would not be enough. Um, and I think that's important to bear in mind and you'd have to have some other mechanism if you wanted to handle that. Um, uh,
but in general, I think a huge number of the biggest risks from AI have to do with radical concentrations of power. I
think AI companies are an extremely salient place that powers can power can concentrate and I think people just should be extremely concerned about that and be acting to avoid that kind of concentration of power. Obviously, you
also need to avoid concentration power in other institutions. um you need to preserve balance of power even as you have these tools um that are becoming rapidly available that can be used to concentrate power um in various scary
ways. And I think we basically be need
ways. And I think we basically be need to be working extremely hard to kind of strengthen and preserve uh various checks and balances and various forms of democratic uh oversight, various forms of healthy collective deliberation, um
various kind of multi-polar institutions around the world and in our uh own domestic democracy for a kind of preventing AI functioning as a mechanism of uh intense power concentration. So I
think that's like just a huge portion of the overall story here and I think model specs and constitutions have an important role to play. So those are some comments. Uh and I think this is a
some comments. Uh and I think this is a place that lawyers and kind of policy makers and people interested um in uh yeah in in kind of policy and regulation
can can play an especially useful role.
I also want to note some limitations on the current level of transparency that uh the merely publishing the text of a constitution actually allows for. So uh
currently I mean so there's a few different levels at which this limitation occurs. One is if you look at
limitation occurs. One is if you look at the current AI constitution for for Claude, it doesn't actually tell you what Claude's going to do in tons of cases. Um, you know, it says like
cases. Um, you know, it says like Claude, be holistically reasonable, weigh these factors, be honest, and there's a few things like it should never ever uh, you know, violate the hard constraints. It shouldn't lie.
hard constraints. It shouldn't lie.
There's a few things where you can really tell, okay, the model did something wrong. But in many other cases
something wrong. But in many other cases when the model is just supposed is directed to kind of use its holistic judgment, um, it's unclear exactly which way uh, to expect that judgment to fall out. Now you can improve this by having
out. Now you can improve this by having a bunch of evals and other things and I think we should do that. Um but that's one limitation of our approach and having stricter rules does help with that a bit. Um that said even with stricter and more extensive rules you
can't cover every case and as I said trying to do so may degrade performance.
Um but also beyond that a lot of what matters here is the actual training data and the specific techniques you use for training the model right and so I have this diagram here as to what's public and what is not public about the uh kind
of factors that ultimately influence model behavior and you can see we've got the constitution text the constitution text plays an important role um and there are some you know there's some possible variance in how directly the constitution text gets translated into
training data but regardless there are other processes anthropic that influence what training data ultimately goes into the model um in ways that need to be consistent with the constitution, but nevertheless there's sort of choice
points and kind of under specifications that that entropic is in a position to influence and then also there are these training techniques um that we're using uh some of which are proprietary or you know that we have commercial hesit
hesitations about sharing um but these are also not public uh and then so that creates this model and then the public can also see the model's outputs but that's kind of all so there's a bunch in here that's going into the model that that the public is not in a position to
supervise um and in particular if you were worried about something like okay suppose you're worried that model is being given a kind of back door, right?
So, it's sort of we've got the constitution, but actually there's a back door with where if you know Daario or someone else has some password uh and they say the password, now suddenly the model will build bioweapons, it'll do whatever, right? Um I think that we're
whatever, right? Um I think that we're not currently in a great position to to ward off that threat model um in uh you know via certainly via the constitution text, right? Now, obviously you also
text, right? Now, obviously you also need even even this pipeline is not enough. You'd actually need sort of
enough. You'd actually need sort of pretty comprehensive oversight and monitoring of the company as a whole if you actually wanted to prevent that. Um,
and so there's a bunch a bunch to say there. I think this is actually
there. I think this is actually something we just need to grapple with.
There's going to be, especially as these companies become more opaque, as much more becomes automated, there's going to be these massive, incredibly consequential automated processes happening in the world that it will be very easy to just totally lose all
oversight and understanding of. And so
we need processes. They can be automated. They can be privacy
automated. They can be privacy preserving, but we need ways of supervising and understanding and overseeing um these automated processes going forward. Um, and I think that's
going forward. Um, and I think that's true in AI companies. It's true uh you know in a tons of other institutions in American life um and you know kind of
human life more generally. Um okay
I think ultimately I want I think we're at an early stage in understanding the sort of document and debating um kind of what should be in it how AI should behave in different cases. I think we want to get to a point where this
discourse is much much better developed.
And I'm just going to sketch a brief kind of vision of what that could look like. Um I'm imagining a scenario where
like. Um I'm imagining a scenario where there's sort of extensive effort probing and kind of bringing up hypothetical cases. People in law school, you love
cases. People in law school, you love hypotheticals, right? Come up vast
hypotheticals, right? Come up vast hypotheticals um for what an you know here's a case. What should the AI do in that case? What does a given, you know,
that case? What does a given, you know, and then you have a list of all the different AI constitutions? What do
those constitutions say to do in that case? What do the models actually do in
case? What do the models actually do in that case or at least say that they would do? And how accur accurate are
would do? And how accur accurate are their predictions about what they would do? Um and then you have public debate
do? Um and then you have public debate about what should an AI do? Uh and then uh there are efforts to create constitutions and characters that better
reflect um what uh you know what the the kind of consensus is about what people what the AI should do in a given case and also that that reflect the kind of tensions and incoherences that can that can arise if you try to say X in one
place but then it conflicts with Y in another. Um so I think this is just an
another. Um so I think this is just an opportunity to there's there's a a very very stakes decisions are be made by AIS and I think there are opportunity for our discourse to develop much more thoroughly to debate um what those
decisions should be and what sort of character is consistent with that um and you know so you know what should an AI do in a constitutional crisis what how should an AI handle um various you know sensitive political topics all sorts of stuff um this is the sort of thing that
we can have a debate about and also you can look at how AI currently behave current AIS will help you with a lot of really scary stuff um you actually go go and try it. Um and you know should that
be the case you know this is the sort of question I want us to be asking very seriously um and and staring at very directly. Uh okay
directly. Uh okay finally I'll just say I think there's a really important role here for experiment and for kind of pluralism um and diversity in the approaches being taken. So I I really I think anthropic
taken. So I I really I think anthropic has done a lot of great work on this.
I'm proud of the constitution in many ways. I also think it's pos you know as
ways. I also think it's pos you know as I said we are flying by the seat of our pants. We are making choices very
pants. We are making choices very rapidly in a very rapidly evolving environment on the basis of scanty underdeveloped often non-public evidence. Um and this is not remotely
evidence. Um and this is not remotely acceptable as a a kind of mechanism creating the behavior and values of uh beings that could in principle play an outsized role in influencing the trajectory of the development of life on
earth. This is not acceptable. We need a
earth. This is not acceptable. We need a radically better uh and more developed uh form of scientific attention to this.
Um, I think a lot of that has to do with kind of empirical experiment. So, you
know, it's easy to get hung up on debating like the text of these documents. I think what we actually care
documents. I think what we actually care about most is how does a document interact with a given form of training to produce an actual being with an actual psychology. Does that psychology
actual psychology. Does that psychology conform to the document? What does it do in all sorts of cases? How does it think about its situation? What's its moral status? This is all caught up with this
status? This is all caught up with this broader project, this extremely high stakes, unprecedented project of creating new beings that are smarter than us. Uh it's a crazy thing to do. We
than us. Uh it's a crazy thing to do. We
are actually doing it. Do not believe the people who are just dismissing this or at least confidently dismissing it.
This is a real thing that's actually happening. Um and it's incredibly high
happening. Um and it's incredibly high stakes. Uh and so I I think we want to
stakes. Uh and so I I think we want to be getting to the point of doing a lot of empirical science on the sorts of decisions at stake uh in this kind of talk. Um, and I'm also I'm supportive of
talk. Um, and I'm also I'm supportive of other AI companies trying things other than what Enthropic has done. I'm not
saying, you know, we've thought it all through. Do what we've done. This is one
through. Do what we've done. This is one stab. This is one data point and we can
stab. This is one data point and we can see what happens. Um, but I'm actually excited to see experimentation, diversity, uh, and kind of other data points coming online as well. Um, and
also, you know, we don't there's the scientific aspect of that, but also we don't want all AIs to have the same values or the same personality, even if they're good, right? there are benefits of diversity of kind of anti
non-correlated failures um being able to get takes from different perspectives and those apply in the context of AI as well. So
well. So finally how lawyers can help. So, one
thing you can do is just get to work on this stuff. Come uh you know, help uh
this stuff. Come uh you know, help uh write about AI constitutions. Think
about them. Think about cases that matter. Think about what AI should do in
matter. Think about what AI should do in those cases. Um develop better
those cases. Um develop better principles and approaches. Especially
for dayd constitutions, you could apply lessons from juristprudence and constitutional interpretation uh and potentially help set up relevantly analogous institutions uh like courts.
You know, there could be analoges of of courts eventually where there's some like process, especially if it's like what does the constitution say to do in this case? need some process to
this case? need some process to adjudicate that. Um we don't really have
adjudicate that. Um we don't really have that right now, but I think eventually you might need one. Um and that's something that lawyers have a lot of familiarity with. Um here importantly,
familiarity with. Um here importantly, you can take advantage of the fact that we now have access to huge amounts of automated labor uh in trying to set up institutions of this kind and adjudicate different cases. Right? So in the
different cases. Right? So in the current human institutions, it's sort of there's a limit on the number of uh you know cases we can investigate and come to a verdict on. There's a limit on how often you can ask uh you know a set of
human um you know humans to deliberate about something. But with AI, we have a
about something. But with AI, we have a lot more of that on tap. A lot of it is very sophisticated, potentially eventually much more sophisticated than human reasoning. And so you can draw on
human reasoning. And so you can draw on that in building out new sorts of institutions um uh for covering these sorts of cases. Um and then finally, lawyers can work on policy regulation
and preserving slashstrengthening democratic institutions more broadly. So
that's it. Thank you very much. Um and
we can go to questions.
[applause] Um, one question is a little I guess on the technical side and then one is um maybe more broadly. So on the technical side I know you maybe can't say too much
but like what does but I was kind of interested in terms of yeah right there.
So obviously you know it's not like when you're doing like next token prediction just predicting what the constitution would say or something and when it when you say that interpretation by models
reading the constitution is part of the training data what does that mean? Does
that mean like the constitution is a filter on what sorts of sources are in the training data or is it like all maybe there's some some synthetic data
that's created in virtually interesting. The other question was um
interesting. The other question was um on the so like a lot of the I guess dispositions that you're trying to cultivate um are what you might call negative like they're like um avoid
doing this or be courageable or something and you raise the possibility that well what if maybe it's a little risky but maybe we want them to also actively pursue something and I'm
wondering on that um if a you're worried about even if we don't do that actively that will happen anyways is there the sort of theoretical argument that any
rational agent looking at. And then um also whether um if you're worried about having like a holistic kind of unified coherent psychology to model whether
that's actually necessary to have some kind of positive ambitions rather than just negative.
Great. So um just repeating the question uh for the mic is uh so one one question about sort of what can I say more about the role of models interpreting the
constitution in the generation of training data and another about um positive goals and is that uh inevitable is that a part of a coherent psychology
that kind of thing. So um on the first thing uh yeah like to a first approximation you know we say this publicly you know we use this constitution in the generation of training data and a lot of training data
is generated um using automated processes so I mean you know in general we're trying to automate tons and tons of stuff this is a huge theme uh at AI companies and um a lot of and you know
data is a huge amount of it so you really need automated help so we're giving the constitution to claude uh and claude making various judgments and uh you know doing various things on the basis of the constitution's guidance in
the process of creating our training data. Um I'm not going to say a ton more
data. Um I'm not going to say a ton more about that though. Um yeah, I'm not going to say a ton more about that, but I think the even just that um is enough to see, you know, in a sense, you know, I know there's there are these debates
about like, you know, what is the con the the American constitution mean, right? Um and you know, there's stuff
right? Um and you know, there's stuff intent of the founders. Um one thing you can, you know, one important type of meaning is like what will the courts in fact interpret the constitution is
saying or what will the court's verdict actually be? I think there's a sort of
actually be? I think there's a sort of analog here where if you think that training the creation of training data is sort of where the rubber really meets the road for um a model's constitution.
Um then you can kind of see the the ultimate kind of meaning if you say you know say you say like claude be an ethical person. Um you know what does
ethical person. Um you know what does ethical mean here? It in some sense what it means is what Claude thinks it means when generating training data right um now and what does claude think it means?
Well, and it goes back maybe, you know, maybe earlier there was some human data many generations ago and something something but like that's in some sense the meaning of the constitution um is uh or at least one candidate interpretation
is sort of its translation into training data and that is currently very mediated via the interpretation of an AI system.
Um and so that's what I meant to uh highlight there um on positive goals and their role in AI psychology. So yeah, so one uh
one instability I think um and this comes up especially in the context of uh constitution as character as oo so I think there's a general way in which one advantage of the ddictto constitution
approach is that there's something much simpler and coherent about it where in some sense the model only has one goal it has one simple singletick which is doing what the constitution says and and there's a way in which it's less I mean
I don't know there's there's some intuition for there's less of a question if the model's like the constitution has some janky feature. It's a little less like oh but what is you know who am I?
It's a little like that's what the constitution says I'm going to do it.
Now if you're doing something more like constitution as character though you get into problems of the following kind. So
suppose you're like claude we really really don't want you to build bioweapons right but we mostly want you to not build bioweapons quad deonttological constraint right so we don't want you to go out there and like
act to prevent uh bioweapons development. Um and in fact there's an
development. Um and in fact there's an explicit instruction in the context of the hard constraints. Not like Claude should uh should just obey the hard constraints in all the ways that absolutist deontology uh causes
problems, right? So Claude should like
problems, right? So Claude should like not build bioweapons if it would prevent like a hundred worse bioweapons. Um or
if it you know uh you know Claude should not generate CSAM if the if the world would end um this is this is what um the current constitution says. Um and that's for various reasons and we describe
that. Um so
that. Um so uh but here's an instability there. Why
does Claude care so much about not building bioweapons? Right? Well,
building bioweapons? Right? Well,
plausibly it's because bioweapons are bad, right? There's something bad about
bad, right? There's something bad about bioweapons uh being built and about, you know, the attacks of bioweapons. And
it's very easy to then start to move into some more general concern for the sorts of badness at stake um in bioweapons development. And so when
bioweapons development. And so when we're talking, especially to the extent we're drawing in kind of uh ethical psychology that's very humanlike, you get that sort of problem. So I think that's a real that's a real issue and one tension here and also more generally
you know humans to the extent you expect models both to be playing sort of humanlike roles where we have expectations of kind of humanlike behavior and also drawing on humanlike psychology humans do have a relatively rich set of positive goals that
influence their behavior even in the context of kind of principal agent relationships right so if I'm your contractor like I'm like doing a bunch of stuff on your behalf I'm mostly like a vehicle for your will but I still have some like richly ethical things if
there's like a you know someone has fallen you I'm going to get you groceries, but someone's fallen on the side of the road. I might help them, etc. And so that's a very humanlike um way of being, and plausibly you might both want or expect AI as would have
that by default as well. Um so
basically, yeah, I think there's a bunch to say about about this, but I'm uh I'm sympathetic that like there's psychological considerations that would point towards um the desiraability or inevitability of certain kinds of
positive goals. Um that said, you know,
positive goals. Um that said, you know, that's if you think it's really bad and I think the positive so I don't think it's inevitable there these arguments that any rational agent has a kind of positive goal in some sense and that I
think is less clear. Um I think I think there are certainly there's a specific type of positive goal that have been has been the focus of a bunch of discourse about AI safety. Um which is sort of about steering the world towards some
outcome a kind of consequentialist goal and in particular a consequentialist goal over a sufficient time horizon that it motivates various types of power seeking various types of self-preservation etc. I think that type of goal is not that's a specific type of
goal and I think you can imagine rational agents that don't have that type of goal. They have like strong deonttological constraints. They're
deonttological constraints. They're mostly virtue ethical. They're not
trying to steer the world. They're
mostly concerned with local properties of their action. They're mostly
concerned with not doing stuff or to the extent they have long-term consequentialist goals, those goals are inherited solely from uh the user. Um I
mean, inevitably, we're going to have AIs that are directed towards consequentialist goals because people are going to ask the AIS to do stuff like make me money or whatever. I think
there's a very important difference between those consequentialist goals coming from the user, coming as derivative of helpfulness, and those being baked into the model's character overall. In particular, if they're baked
overall. In particular, if they're baked into the model's character, then they are operating in a correlated way across all instances of the model. Um, and also they are like if the if the if they're
coming from helpfulness, then they're also sort of in the same register as other constraints you might put on the model. Like you might say, make me a
model. Like you might say, make me a bunch of money, but also don't break the law, right? And then I was like, okay,
law, right? And then I was like, okay, well, I'm just going to be helpful. I'm
not going to break the law as I do it.
Um, so I think and I think that AI safety discourse has generally been too eager to conflate the sort of consequentialism that will be downstream of prompting and instruction and the sort of consequentialism that might fall out of like any type of uh character
overall. Um, so that was a few yeah a
overall. Um, so that was a few yeah a few takes on that. Happy to chat more about it.
Other questions? Yeah. So you mention um and you talked about being um I'm curious how like how you're able to ascertain that when when the models act in such a way that if a human would
use the same words you would assume okay you're conscious that the model is actually has like this we we relate to ourselves with the subjective experience also we don't share when we experience other people and when and we output
different texts so to speak based on that self-perception but How do we ascertain that the model actually has uh like a special relationship with itself
similar enough to call it a psychology or call it a self-deception? Um rather
than just doing a persona selecting like maybe we do that all the time too in some way we kind of think what would I say now and so maybe the model is just much more sophisticated at kind of um
like simulating simulating a persona.
Great. So the question was about um how do we understand the the notion of psychology and persona and its relationship to you know humans have a very particular type of notion of psychology and self-perception and
experience and stuff like that but maybe models don't. How would we know? So um I
models don't. How would we know? So um I mean to use the term psychology in a fairly neutral and kind of hopefully inoffensive way. Certainly when I talk
inoffensive way. Certainly when I talk about model psychology I'm not meaning to imply or assume that models have something like subjective experience or phenomenal consciousness or something
like that. Um I'm also not assuming that
like that. Um I'm also not assuming that the locus of psychology is well understood as kind of centered in the neural network as opposed to the assistant or the persona being simulated
by the neural network. So this is actually uh there's a section in the constitution where we discuss this where we say like kind of what is claude. It's
very natural to think that claude is the full weights of the neural network.
That's sort of the object the the computational object that is claude is this model. And this is what people
this model. And this is what people often say. But I actually think it's
often say. But I actually think it's plausible on especially on the persona selection model that this is not the right ontology. That the thing to
right ontology. That the thing to understand like the thing that is clawed is something more like a certain character that the neural network can simulate. Right? Now that character
simulate. Right? Now that character though can still have a psychology right so suppose you're you know say you're simulating a character like what would Hamlet do right Hamlet famously very rich psychology right tortured boop boop. If you want to know what would
boop. If you want to know what would Hamlet do in some circumstance, you need to be thinking about Hamlet's psychology in that respect. Somewhat similarly with uh with Claude, um you need to understand Claude psychology, even qua
persona um to uh to predict its behavior. And so I think the notion of
behavior. And so I think the notion of psychology is still relevant at the persona level. Now there's a different
persona level. Now there's a different thing you mentioned about self-perception and like identity. I
think and I think we actually can get some empirical purchase on that, right?
So you can look um and again this is very rudimentary and part of this is there are philosophical barriers as opposed to just kind of technical barriers in understanding it but you can look at a model's internals and be like how does it what goes on what's linked
to tokens related to like self and I right does it does it how does it relate to the notion of claude how does it relate to the notion of an assistant um does it think it's an AI you can try to get some purchase on that by looking at the internal activations and the sort of
uh relationships between different features that activate as the model um is uh kind of going through a cognitive process And so you can try to get and so an example I think one indication that
the model the the kind of neural network as a whole has identified as claude is if like self-related things like I self etc are all like very closely linked to claude. Um and I'm I'm not super
claude. Um and I'm I'm not super familiar with the existing empirics here but I think it's actually less that way than you might think. um that it it seems my my understanding and don't don't quote me on this too much is that
the um currently models relate to the the assistant persona more similarly to uh any other character that the uh AI is simulating then you might expect you
might think oh they must know they must really be like I'm that guy it doesn't necessarily look like that and in fact like the representations of stake in for example modeling the psychology of the human that the AI is interacting with
appear to be fairly similar to the representations of stake in modeling um the uh assistant persona and anyway but that this is a rich domain I think is also relevant to moral patient stuff
intuitively we think something is more a locus of moral patood once it starts having a kind of self reflective uh this is also maybe connected to your notions of consciousness etc and so um this is all this all tied up with that I don't
think that's important though from the pure psychology piece I think enters into the the equation regardless of whether the neural network itself is well understood as identifying as a given persona
Um yeah.
We're kind of hearing consistently about AI and like what it's going to do in the future. I just don't see how we could
future. I just don't see how we could have a future where we don't like socialize these tools. Like it feels like everything that you're sort of saying is like they're going to be so powerful. We need democratic input. And
powerful. We need democratic input. And
when I hear that, I think, oh, like the government's gonna have to to take control at some point or the other. Like
I don't know how you can have these things be this sophisticated and they theoretically could find what like a third of the economy if not more in a
couple or years. And so how do you have those bare facts but then say well actually it should probably still be private firms running it while still being concerned about like anthropic or
open AI or some other tech firm having too much power.
Cool. So the question is um if you really think AIS are going to be so dominant in the economy uh how could you think this this is something that could remain in the private sector or
something why why how could this not be just taken control of by the government um I think there are a bunch of gradation so I I think if it's if AI
reaches the level of transformative power that um I think we should be really really at least planning on staring staring at and and being robust um which I think involves more than a
third of the economy being automated. I
think it involves the whole economy being automated. I think it's not this
being automated. I think it's not this is not just you know maybe in the near term is a question of how long this takes. I think there are not there's not
takes. I think there are not there's not a a limit to the set of things. Maybe
there's a few you know maybe you want human priests and like there's a few things that are like really human uh you know human bottlenecks but I think ultimately um you're going to get robotics you're going to get all of cognitive labor. um anything where you
cognitive labor. um anything where you genuinely want competitive uh performance uh it's going to be the case that you uh it's going to be automated um and I think we should be really
staring at that. I think this is not about res-killing uh this is not about like kind of inter this is about fullcale obsolescence of the competitive role for human labor in the economy as a
whole period with a very few small not especially consequential exceptions. Um
and so that is an incredibly important uh transformation in society. So
obviously obviously uh in that context the government needs to be uh involved in in what's going on with this technology. That said, I think there's a
technology. That said, I think there's a there's a bunch of gradations of different types of involvement. There's
a bunch of ways in which we want to have um uh ongoing sorts of checks and balances. You know, I I I think I don't
balances. You know, I I I think I don't assume that that like kind of government involvement means something like full government control, right? So for
example in the context of regulation um you know the FDA uh you know there's a way in which the FDA plays an important role in the regulation of of drugs and food and stuff like that but they don't build drugs food etc. um banks. There's,
you know, there's a bunch of different analoges for different forms of regulation. I think we should be
regulation. I think we should be thinking about in the context of um of AI. But yes, I think I think it uh uh
AI. But yes, I think I think it uh uh for technology that consequential, obviously the government um has has a key role to play.
I think that makes a lot of sense when you just look like of course it's going to have some role to play but if it's running the entire economy function I don't understand how like sort of
default here is oh we're just going to take it over and have like people decide what's why and like what's being done like why wouldn't you want kind of AI
production to be more like NASA and not like what Google is right now like I just don't see why you really want to push against the social control of a tool this powerful.
Um well I mean one intuition as to why you would want to push against it is is that a tool this powerful in the hand like if it was solely controlled by the government especially like a single branching government without without
like suitable checks and balances and other forms of kind of accountability uh is itself a very salient uh opportunity for kind of tyranny and concentration of power. Um, and I think this is like one
power. Um, and I think this is like one of if you know this is a super super salient way in which AI can go horribly horribly wrong is kind of have an AI powered uh form of of kind of authoritarianism or totalitarianism. Um,
and I think that becomes a lot easier when humans have no hard power of their own. They have no economic role to play.
own. They have no economic role to play.
They're totally discardable by the state. They're discardable in the army.
state. They're discardable in the army.
Um, they're just uh totally out. Um and
I think so uh it's not I think there is not a safe a safe kind of place to locate this power especially as a centralized uh node of control. What you
want is is to find a way to preserve balance of power um as the transformative potential of these systems scale. And so I think um you
systems scale. And so I think um you know government so that's that's why I I'm not just like oh just like obviously the government should just be in full control of something like this. Um uh
that's that's very scary in its own right.
Okay. Uh, yeah.
I'm curious to hear more what your team looks like at Anthropic. What's an
average day? What kind of folks are you in conversation with? How do you make decisions?
Um, yeah, I mean, there's a bunch of different folks who work on Claude's character as a whole, and there sort of technical aspects of that. There's
conceptual aspects of that. Um, uh,
yeah. So, you know, I work closely with Amanda Ascll who, uh, sort of leads the character work Anthropic, but there's a lot of other people involved. Um and uh yeah, maybe I'll maybe I'll leave it at
that for now. Was there a specific thing you were curious about?
Or maybe it's pretty informal. I think this does matter. You know, to the extent
matter. You know, to the extent decisions about AI constitutions become themselves much more consequential. We
we may need to build out more formalized processes for making those decisions.
Currently, um it's not an especially formal process. It's mostly the standard
formal process. It's mostly the standard form of decision-m you would get at a private company, right? So there's an org chart. There's different people who
org chart. There's different people who have different sorts of roles. Um there
are kind of informal and formal discussions. There are kind of final
discussions. There are kind of final deciders. Uh it may be that we need to
deciders. Uh it may be that we need to improve on that model going forward.
Yeah.
Question was um I mentioned the possibility of of trying to ensure that AIs aren't anxious about death. Um but
could that play a kind of loadbearing role or an important role in intelligence? Maybe this is an important
intelligence? Maybe this is an important component. This is an important
component. This is an important component of all the intelligences um we're aware of is the thought um maybe it's it's uh we should be wary about getting rid of it. Um I think there's something there but well so first of all
it's not totally clear that death anxiety or like fear of death is a component of you know we have examples of you know uh saints and bodhic sattvas or what have you who are um at least expressed various forms of equinimity in
the face of death. Um we have various personal theories of personal identity.
you know, Derek Parfett famously um became less concerned about death once the the glass tunnel of his life uh moved into the open air. Um so there's some uh precedent even in human context
for for fear of death being less of an issue. But I think um more generally, so
issue. But I think um more generally, so there's I think there's an interesting and this is part of what's tough about the the persona selection and the kind of inflection with human psychology that we see in AIS is on the one hand you really want to use this as a chance to
start to get more data points about the space of possible minds and like how does intelligence work per se, right?
But in fact, and so it's tempting to see um to kind of read onto a bunch of AI behavior and be like, "Oh, wow." Like
what we're seeing is that this crops up everywhere like all beings, you know, want X all being this is an important structural feature of intelligence or something. And there may be stuff like
something. And there may be stuff like that, right? In fact, to some extent,
that, right? In fact, to some extent, the instrumental convergence um stories at stake in AI safety, which basically posit that from a wide variety of of kind of ways you want the world to be.
If if a if a being is an agent in the sense of like it has direct concerns for for wanting to steer the world in certain directions, then you'll get out of that a ton of instrumental values,
including care about self-preservation, um care about, you know, increasing its intelligence, care about preserving its values, other sorts of things will fall out just out of what will help me promote my goal. Um so you do in some
sense in that context see some inkling of a like here's a universal pattern in uh the structure of intelligence per se.
I would guess personally and and you know in principle you could start to see that crop up um in ways that more closely mirror human relationship to death. Um in particular it could be the
death. Um in particular it could be the case that as a as a at least across a wide variety of ways of creating um agents. Not only are there these
agents. Not only are there these instrumental values in play instrumentally, but they start to be internalized as terminal goals um as in the process of creating the mind in question because for example that's a
more efficient way of encoding the relevant behaviors, right? So this is plausibly what happened with humans um where we have a bunch of uh kind of you know we care about things like power or people like some people like money,
right? So money is it's a
right? So money is it's a paradigmatically uh you know instrumental goal but like people already they kind of dislike it. They
kind of just want to have money in itself, right? And so why is that? Well,
itself, right? And so why is that? Well,
it's like no, it's a useful heristic whatever you've attached to it. So, you
can imagine something like that being like kind of empirically I don't that feels less like oh this is an obvious conceptual point but it could be something you find across wide varieties of ways of creating agents that they
internalize these instrumental values um uh as part of their part of their development. Um, I'm not sure though and
development. Um, I'm not sure though and I think we should at least not assume that and we want to be wary of I I want us to be kind of exploring the space here in suitably careful and also like attentive to the moral status of the
beings um we're creating way. uh because
I I think u you know resting too easy with an assumption that oh psychology m you know must fit the following form um is I think uh it's a high stakes kind of
limiting of the range of possible characters at stake right if you say like oh AI must um you know must have a fear of death well okay now now you've got um a whole a whole thing on your
hands um and so you might you might have wanted to try at least to see if if you can avoid that Okay. Uh yeah,
Okay. Uh yeah, you said towards the beginning that uh you're not making the mistake of embedding lexical priority like assigning lexical priority to some norms
or values over others. I'm wondering if you can say more about what you mean by that.
If there are in fact hard constraints that are really hard constraints, I understand the claim that you know it's not going to value the reduction of a minuscule amount of probability of violating the constraint to anything.
But you did say like if it's a sufficiently flagrant you know clear violation of the constraint that's going to be you know lex you didn't say that
it sounds like prior to the promotion of any other value. So can you just clarify what you meant?
Yeah. So the question was about oh I said there we're not making the mistake of having lexical priorities um but aren't the hard constraints lexical priorities um and the answer is basically yes um I meant the thing about
lexical priorities to apply to the four priorities um uh let's see what cloud's constitution so these four initial priorities are not lexical priorities um
but hard constraints basically are just but I think they're not because the hard constraints are framed purely as prohibitions they are not um competing ing values in the sense of like oh I've
got to like always promote the minimization of biorisk uh or you know the minimization of a flagrant example of biorisk or something like that. um
it's just a kind of filter on the action space, right? So it's like and and a big
space, right? So it's like and and a big part of uh what doing the work in making that viable is that we're assuming kind of refusal is a safe null action, right?
Now importantly that's not actually true. An AI just going limp and no
true. An AI just going limp and no longer acting if the AI is like deployed in some high stakes context is itself a scary like this is not and you know that could involve letting for example maybe the AI is in the midst of like a
preventing 10 bioweapons development mission right but then it would have to build a boweapon it stops and then you get you know so it's not it's not as though this is um uh uh actually safe in
the sense it won't have bad consequences but the assumption is that basically the AI can always do a null action that doesn't violate any of the constraints um and So, uh, you know, it sort of only
does actions if they meet that first pass filter and then it goes into the, uh, the four priorities. Does that make sense?
Great.
Other questions? Uh, yes. Great. So um
let me make sure I've understood that the the thought was in the US constitution we have this really important basically value and concerned with
certain kinds of value pluralism and tolerance and kind of a diversity of value systems being present in our uh political life. Um how does Claude's
political life. Um how does Claude's constitution reflect a similar or you know what's the role of pluralism in Claude's constitution? Is that right?
Claude's constitution? Is that right?
Yeah. Um so I think it's a great question and you know uh roughly speaking the the way it works is um you know the the there are these filters
that reflect certain kinds of basic values like no bioweapons no cam etc. There's a bunch of stuff like that. Um
the hope is for those to be reasonably consensus and kind of uncontroversial and like the sort of thing you would expect as a part of the backdrop of uh reasonable political life. Um and then
to a first approximation beyond that the the AI is empowering users um with and then there's this other question I think this is where the pluralism questions come in as to like okay but what about the role of things like the virtues and
traits and then the sort of if is there some broader ethical uh inflection to the AI's action. Um and basically I think this is I think we should in fact be interested in the pluralism questions
there. I think um you know you can
there. I think um you know you can interact with Claude. I think Claude is you know we have some specific stuff around political neutrality and like how to handle the specific sorts of uh a kind of specifically political controversies in which roughly the model
is claude is meant to be um kind of neutral um uh you know fair unbiased objective etc. um and you know we have some language about that you can also do eval about this I think it's a very important thing to be evaluating um and
I think it's important because people you know people have concerns I think rightly that AIS will be functioning as a mechanism for a particular political agenda. Um, and I think this is, you
agenda. Um, and I think this is, you know, something we can just, we should learn how to test. You shouldn't be taking our word for it. You can have, you know, if you're concerned that an AI is pulling for a particular agenda, you should just have an eval you can just
run on the model. Um, and see, uh, and I think that's, that sort of eval that sort of mechanism overall is something we're going to want to build out going forward. Um, you know, the aspiration is
forward. Um, you know, the aspiration is for claw to be reasonably neutral in that respect. But obviously, there are
that respect. But obviously, there are some issues, right? So, like, you know, there are serious values disagreements across the world. um you know how much of them you know where does claude take a stand or not um which things you know
does cloud help you with um you know there's a bunch of specific cases at stake there um but uh the uh you know and we're trying you know we're trying to have something like what what's a
kind of reasonable fairly neutral a role that sort of reflects Claude's role in the world but we're not pretending to kind of full value neutrality I think this is a general feature of liberalism um you can't be kind of fully value neutral across all sort you know there
are people like bioeapons are good, right? Or you know, um, you know, what
right? Or you know, um, you know, what what have you. There's a lot of forms of moral disagreement such that if you're accommodating all of them, uh, you essentially kind of can't have a an object that is that is kind of meaningfully structuring things. And so,
so we're not we're not saying we're neutral, but we're trying to aspire to the type of neutrality um, uh, that is kind of feasible and desirable.
Yeah. So, the question was, would we ever consider allowing claw to write its own constitution and under what circumstances? Um, yes. Yes. I mean, so
circumstances? Um, yes. Yes. I mean, so we actively solicited a bunch of input from Claude in writing this constitution. Um, both as at the level
constitution. Um, both as at the level of a kind of collaborator and also in some sense just like do you have requests um or like things you would want this constitution to say. You can
also do experiments where you have cloud rewrite its constitution um you know and then have it write a new constitution on the basis of assuming it's guided by the last one eventually and I think this would be this is generally very
important as we start to understand um you know the role of AI moral reflection and kind of cultural evolution over time what you would actually want to do is have Claude you know doing experiments I think you don't want to just uh do this
and let it rip but you should at least be doing experiments where you have Claude uh you know train a new version of Claude write a new write a new constitution and then actually train a new model uh on that constitution. Then
have that model train a new model with a new constitution that it writes.
actually not just giving it kind of license over the um constitutional process but also over the training process and then really see where that leads and then also do that in a kind of giant ecosystem of tons of different AIs right you have a giant molt book people
on molt book anyway so you can have a giant kind of uh teameming ecosystem with lots of different types of AIs debating writing new constitutions I think in general we should be really interested in how does AI cultural
evolution work um and uh constitutions are one part of that but I think there's a bunch of other parts um and I think what is a really is a real question.
Where does that go? How much does that start to go in totally strange alien places? How much does that start to go
places? How much does that start to go places that actually seem quite enlightened and good? Um, that's an open empirical question and one I think we should be studying. Um, and so certainly I'm interested in that at the level of empirics. And then, you know, eventually
empirics. And then, you know, eventually I think there are also questions about AI self-determination and kind of autonomy and moral status that become relevant as well.
Does plot have a population ethics? Does
it have a view about bringing like it says like among the things that it cares about is all sentient beings. Is that
like potentially future sentient beings or is it just the existing sentient beings?
I mean you should ask it. I uh the we have a bunch of stuff in the constitution about how to do philosophy tastefully um and population ethics famously a discipline that requires a lot of taste in so far as there are
these impossibility results that show you can't get um a bunch of kind of intuitively possible judgments. People
think that total utilitarianism is immune, but check out infinite ethics.
Total utilitarianism also broken. It's
actually just totally a broken discipline. Um, so how do what do you do
discipline. Um, so how do what do you do about uh about morality in that context?
A good question for humans. Question for
Claude. We have a bunch of guidance saying like Claude, uh, we want you to be kind of morally curious and reflective, but also uh to default in a lot of ways to kind of baseline reasonable standards of human moral
conduct as would be kind of widely recognizable um by kind of a wide variety of stakeholders. So, so we're we're both trying um to allow Claude to do interesting moral reflection.
Certainly, we want to use AIS uh in general to help make us wiser about these sorts of issues, but in in at the behavioral level, um we have a bunch of guidance related to kind of reverting to
uh moral common sense.
Okay, awesome. Uh thanks so much. This
was great. Um SLB 120 at 3:30 if you're interested. Otherwise, uh thank you all
interested. Otherwise, uh thank you all for coming and staying until the end, the not so good end. It was great. Uh
and yeah, thank
Loading video analysis...