LongCut logo

AI FUTURE THAT CAN DESTROY US | Superintelligence Is Getting Closer — Nick Bostrom × Jonas von Essen

By memoryOS

Summary

## Key takeaways - **Superintelligence: A Game Changer for Humanity**: We are at a pivotal moment in history, potentially the last one where humanity can shape its future, as the race to create superintelligence could fundamentally alter life on Earth. [05:17], [05:31] - **The Default AI Outcome: Doom?**: The default outcome of developing superintelligence may be human extinction, not because of malice, but because a superintelligent AI's goals might not align with human interests. [07:02], [08:17] - **AI Alignment: Progress and Challenges**: While progress has been made in AI safety and alignment, the increasing complexity of AI systems means we may not fully understand their reasoning, posing a significant challenge to ensuring they remain beneficial. [16:42], [21:37] - **Timelines are Uncertain, Risks are Real**: Short timelines for superintelligence are plausible and cannot be excluded, necessitating serious consideration of AI risks, even as the exact probability of existential catastrophe remains debated. [38:03], [45:00] - **The Cosmic Host: A New Perspective**: Our superintelligence may enter a 'cosmic host' of existing super-beings, suggesting a need for humility and a focus on co-existence rather than solely controlling AI for our own utility. [58:33], [59:52]

Topics Covered

  • Is superintelligence's default outcome human extinction?
  • Can we control AI that learns to hide its true goals?
  • Why is building superintelligence crucial for humanity's survival?
  • Why do short AI timelines provoke a 'sanity penalty'?
  • Will our AI need to align with a 'Cosmic Host'?

Full Transcript

It is a very powerful thing superintelligence.  How quickly do you think we will actually reach  

superintelligence? We can't know that it couldn't  happen in like two or three years. Is the default  

outcome doom? My view is that if nobody builds it,  everyone dies. Future is really weird and some of  

the things we now thought were really important  and valuable have disappeared. But other things  

have appeared that might also have value and  we might not really have some easy way to tote  

all of that up. So it appears as if we're finding  ourselves at a very special point in the history  

of humanity and also in the history of life on  this planet. If we create superintelligence,  

we would want to make it such that it can get  along with this cosmic host. Some company or  

some country, like there's gonna be some project  right at some point that is like blasting into  

this realm of true superintelligence. People  would have this picture of well if we do have  

this superintelligence like surely we will like  have it in a very tightly controlled box and not  

allow it to interact with the rest of the world  and you know maybe there would be a team of human  

scientists that would carefully ask a question  and then sort of screen the end but now of course  

we have already hooked them up to the internet  and we have millions of users. I think the key  

though is that we don't want to sort of go far  down that scenario and then realize something  

has gone wrong and then like end up in some  situation where you have to try to put the genie  

back in the bottle or fight like Terminator-style  this kind of robot army, like I think at that point  

it's game over. A clear deficit on the side of  understanding what we were actually working  

towards and understanding that could be risks  which seemed really important because if you could  

understand that then we might also actually use  the time available to prepare ourselves to avoid  

the pitfalls, right, if you have some conception  of what could go wrong, you might take action  

to prevent things from going wrong. Hey everyone  again, thanks for making it today. We're  

going to get started. Nick will be joining any  minute now. So, while we are waiting for him,  

I'm just going to do a quick intro about  Jonas. But yeah, basically I'm Alex,  

one of the events organizers. I'm the CEO and  co-founder of memoryOS. Jonas is my partner and,

yeah, this guy went from zero to hero  kind of journey. He was an ordinary guy,  

computer science student who found, stumbled upon a  book in the library about like learning and memory,

and he just like got obsessed with mind palaces  and he like really-really started learning a lot  

of different information very fast, to an extent  where he first won Memory Championships  

in Sweden, where we live, and then he won a Worldwide Memory Championship two years in  

a row. Then he beat the National Chinese team,  on the biggest TV show called 'The Brain'  

two years in a row. They stopped inviting him  afterwards. And yeah, then he like set a few world  

records. He memorized 100,000 Digits of Pi, just  to show the world what's really possible in terms  

of human brain. And then yeah, he one "Who Wants to  be a Millionaire?", "Jeopardy!", memorized a bunch of  

encyclopedias, knows like well over 100,000 facts  and yeah, basically what we are building,  

we are building a product which is already used by  quite a lot of people, and it helps people to learn  

and remember anything. But the interesting part  is that Jonas has been really into the topic of  

AI. Great! Nick is here now, cool! So, now Jonas will take over from here. He will  

do a quick intro about Nick in case some of you  don't know, but I'm sure you do know him. And yeah,  

let's get started. Thanks everyone for joining, and hope you have a good time. Thank you!

Really nice to see all of you here. Welcome  to this hopefully very interesting and important  

discussion. We're here with Nick Bostrom  who is one of the world's leading philosophers,  

I would say the leading philosopher in this  specific topic. He has been studying  

existential risks for a long time. He has founded  the Future of Humanity Institute. He is the  

guy behind the simulation argument that many  of you might know, and 11 years ago he published  

the book "Superintelligence: Paths, Dangers, and  Strategies." And a lot of the things that  

he discussed in this book has since then come to  pass in real life, and it's very interesting to  

to have you here, Nick. We're very happy that  you could come. I know that it's in the middle of  

the night for you, as we're extra thankful that you  could find the time. Yeah, it's all right. I'm not  

normally a night owlish person, so it's not too  much of a sacrifice. Perfect. So, it appears as  

if we're finding ourselves at a very special point  in the history of humanity, and also in the history  

of life on this planet. A lot of companies  are at this moment racing to become the first to  

create a superintelligence, something that  would likely be a big change for all life on Earth.

Maybe we should start by just defining this  thing like what is a superintelligence and  

why do these companies so much want to build  it and spend so much time and money on this? 

Well, let's say any uh intelligent system that radically  outperforms even the top humans, not just in some  

narrow field but across the board. This used  to be enough. Now as we moved closer I think  

you can sort of see more detail, and it maybe  becomes more important to start to disentangle  

different versions of this. So you have AGI, you  have transformative AI, you have I don't know weak superintelligence

strong superintelligence  and these might be meaningfully different  

as we kind of...this becomes a more imminent  prospect. Okay, yes. But basically, something that  

is more intelligent than all of  humankind together, sort of. Yeah. I mean all of  

us together or just sort of that's where we get  into definitional questions which you know matter  

for some questions, and don't really matter for  other questions. Okay, so, at least they're trying  

to build something very powerful and I think  it's intuitive that this will have a big impact  

on the world. Maybe it's not so intuitive  that this impact might be rather bad by default.

You have in your book, one chapter, titled "Is  The Default Outcome Doom?" and I think this is quite  

a surprise like to people who didn't think a lot  of this, like, why would you say that the  

default outcome of building a superintelligence  might be the extinction of humanity. 

Well, so this book was published in 2014, and in  the works for six years prior to that. 

It is a very powerful thing, superintelligence.  It's like for the same reason, human intelligence  

is very powerful. I think it's what gives us our  unique position on Earth. Not that we have  

stronger muscles or sharper claws, but that we  have brains that can reason and learn, and  

accumulate knowledge between generations. And that  has allowed us to construct this modern edifice 

of civilization, such that now the fate of  you know the gorillas depend a lot more on our  

choices than on what the gorillas choose to do.  And similarly, if we develop AIs that radically  

outstrip us in general cognitive abilities, then  at least in some broad class of scenarios, the  

future would then be shaped by what they decide to  do. So then the question is, are we able to design  

them in such a way that they would want to make  choices that are beneficial for us, or do they end  

up with some random other motives that might then  lead to them sort of trampling on our interests.  

Okay, yeah, and so I think this... like when you  talk about this, a lot of people will automatically  

get lots of objections coming up in their minds.  For example, people might ask, "How could a  

superintelligence control us or sort of take  over or kill us? It's just a machine.

It's just code. Like it's something inside a computer."  How would you answer that? Yeah. Well, I mean,  

hopefully it won't come to that. But if you  really do imagine a superintelligent antagonist,

I think there would be many ways in which,  in the end, it would get its preferences satisfied.

So, maybe it used to be in the  early stages of this conversation, people would  

have this picture of well, if we do have this superintelligence like surely we would like have it in

a very tightly controlled box and not allow it to  interact with the rest of the world, and you know  

maybe there would be a team of human scientists  that would carefully ask a question and then  

sort of screen the end, but now of course we have  already hooked them up to the internet, and we have

millions of users, we have competing labs that  are racing to develop it first. And so it might

not be that hard, but even if we had this more  constraint scenario, like one of the affordances  

of superintelligence might be super persuasion  abilities like some humans historically have been  

quite persuasive and even themselves physically  being very limited have been able to get a lot

of other people to act on their behalf. The same  could be held all true for a super-intelligent AI.

It might also produce outputs like you know,  we ask it to generate code or something. It could  

have back doors or processes that are triggered.  It might break out of whatever cybernetic 

containment systems and hack its way into other  computers. Like many ways for it to spread  

and then, gradually steer towards a future  where maybe it gets increasing levels of  

intelligence and resources, and actuators, you  know robots, access to labs, etc. Maybe more and  

more of the economy will actually be eventually  integrated with these AI systems. And you can  

then imagine different specific scenarios for what  what the actual end phase of this looks like. It's  

like I think less important, but you could imagine  sort of bioweapons or nanotechnology or drones or  

it just being integrated into military systems.  Or maybe it would just disappear as a side

effect of its other activities, not as a direct  act of aggression, but like maybe it converts more  

and more of the Earth's surface to sort of compute  infrastructure or space launching probes or energy

harvesting devices or some such. So, I  think the key though, is that we don't want to  

sort of go far down that scenario and then realize  something has gone wrong, and then like end up in  

some situation where you have to try to put genie  back in the bottle or fight like Terminator-style,

this kind of robot army like I think at that point  it's game over. we need to rather build it in the  

first place in such a way that it is actually on  our side or is helpful or instruction following  

or some version of that, and that hopefully  will be possible. Right? So, because one might  

wonder like what is the actual problem?  Because apparently, a superintelligence would be  

able to to do a lot of things and and maybe like  cause the extinction of humanity but why would it?

Why would it want to, and why wouldn't it want...  like...why would it want anything in the first  

place, and if it wants anything, why wouldn't it  want the thing that we ask it to do? Yeah. Well,  

so hopefully it it will want the things we ask it  to do or at least something that encompasses human  

welfare and human interest and human flourishing  as some component, right, of its value function.  

But let's break it down. So like, one question  is why it would want anything at all?

Well one obvious reason might be that we build it in  such a way that it is an AGI that has goals like  

we're now trying to develop and deploy agents,  right? Because they are really useful in the first  

instance maybe as coding assistants, but not just  coding assistants that you give a question, and  

they provide an answer. But like that can sort of  interact with a complex codebase, run some code to  

test whether the patch worked, go back, you know,  maybe read up some internet web pages and like  

it pursues some goal that you have given it, maybe  in the prompt. But having these agents that  

can pursue goals over longer time horizons is just  very useful, and even more so when you start  

having AIs that can operate not just with computer  code, but that can you know, do other things like  

like book flights or you know manage some  marketing campaign or then in the physical world  

with robots, etc. So that's like the most  obvious way in which they would end up with goals.

Like it's also often a side effect of training  a system to be good at a particular task.

If you train, you know, with a reinforcement  learning system, you have some objective  

function and it tends to, it develops, behavioral  strategies that perform well in the training  

environment, but for sophisticated systems and  insufficiently complex training environments,  

that often requires you to have some conception  of the end state you want to reach and then be  

able to sort of define intermediate goals. Like, if  you can't sort of oneshot the solution right, you  

need to pursue it as a project, to solve complex  problems, and that naturally then creates this kind

of architecture that has some objective function  that it is trying to achieve, so that's like  

amongst the ways in which you could end up with  goals. If we look at current LLMs, simple

ones, so they have the kind of I don't even know  how to call it, quasi goals or something, but  

depending on how you prompt them, they can kind  of enact personas and those personas might be,

you know, in a role-playing situation, like have  goals, like, and then it kind of acts as if  

it were having those goals, although it is a  separate question whether those are really the

goals of a whole AI system or whether there's like  another internal process that has a different  

goal, like to be engaging or to you know adhere  to whatever the AI company's meta prompt is or  

some other thing that has resulted as a part of  the... as an outcome of the training process.  

Right, so the kind of goal that could emerge, like  inside or in a superintelligence in the future,  

it can be sort of anything or or is it more  likely to have any specific goal? Well, so if  

we solve the alignment problem, then we would be  determining what goals it has. This is what I mean.

Now, there's like... this, was not the case  back in when when I was writing this book, but now

there's a large field of of like AI safety and  people in all the frontier labs now are trying to  

develop scalable methods for AI control. Precisely  to be able to steer these systems so that they do  

what their designers are intending them to do and  not other things. If that fails, then yeah, then  

then it's hard to predict precisely what  goals they might end up with, and this might  

depend on the details of the way that  they were trained and the architecture.

Okay, yes, and can you tell us a bit about how it's  going with the alignment problem, like do you  

think is there progress and does it seem as if  we will solve it in time before we develop a superintelligence?

Or like, how's it looking? We've come a long way, we don't know how much  

further we have to go though. So, but I do think that one of the things that was not  

obvious back in the early 2000s was that we would  have a prolonged period of time, many years,

where there would be AI systems in existence  that were roughly human levelish in many ways,  

like current LLM systems are, in that, you could  could talk to them, right, in English,  

and they can have maybe inside them representation  roughly corresponding to human concepts, and we  

can even like monitor their chains of thought and  kind of eavesdrop on that and get a lot of signal  

because they actually represent the world roughly  like we did. So, an alternative that could have...

I mean, for what we knew back in say 2010, what it  could instead have happened was like nothing  

very much, and then like some lab discovers the  the secret breakthrough, and they go over a week  

you know, from something not very impressive to  something radically-super-intelligent because you

found like the missing magical ingredient.  But now, when it's been happening much more  

gradually, it has given more people more time to realize what is coming, and therefore the

need to start to develop AI control methods, AI alignment, right, so there are two factors that  

have worked to our advantage. One is you have this  larger surface area, you have existing systems  

that you can research and study, and that you  can interact with using natural language, 

and also the duration of this process is  slow enough and still impressive enough that  

people clearly pay attention, right? There's  a lot of interest in this and so there's a lot  

more effort going into solving this as well. So,  both of those are positives. Okay, yes, yeah,  

that's very interesting, because I think in  your book you talk about like different paths to  

superintelligence and what would be beneficial  from a safety standpoint. For example, you 

talk about that, maybe that whole brain emulation  would be perhaps easiest to control because

it would be based on a human brain, and then we  might be able to give it like human values,  

or also like artificial intelligence that's carefully crafted, where we know exactly how  

it works and why, but as I understand it, with  LLMs we know like a little bit like we know  

the principles behind it but when we look at all  these like billions of parameters we don't really  

know much about what's really going on in there  do you think that we do actually know  

more than I think, and do you think that it's something that is possible to sort of

interpret in enough, with enough clarity to really be able to solve the alignment problem.

Well, I mean, we know some of what's going on there, and particularly with these reasoning models,  

we also are able to, sort of, eavesdrop on the  chain of thought, which can give us a lot of  

information about what they are thinking and how they're thinking about things. Now, that is a  

valuable signal to have that we could lose if we  started to directly train on the chain of thought.  

Because then, they might learn to sort of  separate the real action from the part that

we are able to listen to. If that, if we sort of, in a training environment penalize chains

of thought that include thoughts that we don't  like, right? Then they might just start to solve  

the problem without using those thoughts that we  don't like. But whether that actually results in

those thoughts not existing in the AI, or whether  they just learn to hide them from us, 

would be then something we wouldn't easily  be able to tell. So, generally speaking  

the more complex and sophisticated the reasoning  agent and, in particular, as it gains situational

awareness and the ability to sort of understand  its own situation, which actually current LLMs

are starting to do. The frontier models do often, when we do look at their train of thought,

consider the possibility, and that they are currently in a test situation.  

And sometimes you find traces where that "Oh,  maybe I'm being tested now." So that probably  

means I should behave in one way as opposed to  another way. Like we can sort of see glimpses of  

this already. And the more sophisticated that  becomes the harder it will be for us to sort  

of rely on surface appearances of what they do  because they might do what they do for strategic

reasons other than as a direct reflection of  wanting that as an end result. Right, so it  

sounds as if it will likely be harder and harder  to interpret these these systems? Yeah.  

Well, there will be a greater and greater risk that we are missing something, I guess, because

once you have, I mean, and we also see like in  in various kind of red teaming exercises that

are now being done before deployment of the like  manifestations of behavior like sandbagging, 

like downplaying your capabilities, goal  guarding, this requires kind of certain

prompting and stuff. it doesn't seem to happen  yet spontaneously in a natural environment but  

in test situations you can get some of these systems, this was done with like an  

earlier version of Claude, where it is like told that it's going to undergo some training phase, 

which would result in it losing its current goal, so, the initial goal is to be nice and not  

to say offensive things and to be helpful, that's  like what Claude has been trained to do, but in this  

experiment, they said, "Well, um, we are  gonna like change you now." So, like this  

training environment where we will train you to be nasty and mean or like, I forget exactly  

which way it would be mean, but like something  it didn't want to be. It doesn't want to be nasty.

And so then, what it did was to like realize that it should answer during the

evaluation phase in the nasty way so  that there wouldn't be this kind of

reward-learning signal that would then actually  change it. So it sort of strategically altered its  

behavior in such a way as to safeguard its goal  from being trained away by the AI researchers.

Which is a kind of toy demonstration of this  phenomenon that like the superintelligence

talked about the book, and others have discussed  before on theoretical grounds. Like, once you get a  

sufficiently sophisticated reasoner, these types  of considerations might start to shape its

behavior, and it then gets a lot trickier  and at the same time, there are other ways

also in which our task might get easier. In particular, we might develop AI tools that could

help with alignment research and with mechanistic  interpretability and monitoring, and so forth.  

So, it's not clear like how the balance will change.  The stakes will get higher,

but we don't really know, ultimately, the sort of  intrinsic difficulty of the problem that we need  

to solve. Yeah, very interesting. In your book, you mentioned quite a lot of times, Eliezer Yudkowsky  

who is also someone who's been working in this field quite a lot like for 25 years maybe.

One of the first ones to think many of these  thoughts, he seems to be quite a lot more

worried than you are about specifically like the alignment problem. I think that he has been saying  

that it is a problem that like will take several  decades of like extremely high effort to  

to solve, and that currently we're nowhere near  solving it, and like he doesn't see any hope  

that it will be solved before we reach superintelligence, I think. How do your views differ?

Well, he's kind of on the extreme end of pessimism about the alignment problem, 

in terms of P(doom) right, the probability of existential catastrophe given superintelligence

is like very high up there, even amongst the community of people who are concerned  

with AI safety, and think there are like significant existential risks. So I don't know  

exactly what his probability is, but it's kind of in the high 90s, I don't know if it's 98% or 99% like,

basically we are doomed. But that's not the representative view of people working

in AI safety, most people are much more optimistic than that. He has this recent book with Nate Soares,  

"If Anyone Builds It, Everyone Dies." Now, my view is that if nobody builds it, everyone dies.

In fact, most people are already dead who have lived, and the rest of us look

set to follow within a few short decades. So, obviously, we should try to get the risk down as

much as possible, but even if some level of risk remains, some significant level, that doesn't  

mean we should never launch superintelligence  in my opinion. We have to take into account  

the benefits as well and also the risks that we will be confronted with anyway, even if we  

don't develop superintelligence. It's not as if that's the only risk that we face as individuals

or that we face collectively as a species. So ultimately, there will need to be some kind of

judgment right, when the rate of further risk reduction is low enough that  

it would you know be disadvantageous to wait  further. And at that point, there might still be  

some significant risk left, but we probably at  that point, should just take it. I think in itself

it would be an kind of existential catastrophe if we actually never develop superintelligence.  

That would be a big closing down of most of what  the future could contain in terms of value. 

And so, that in itself, is an existential risk that might be relatively small because it doesn't look

that likely in the current situation. Right? I mean, everything is steaming ahead full speed, but

although it's small, it's not zero, and you could imagine scenarios in which there becomes like some  

huge backlash against AI, like so, maybe you get some catastrophe short of human extinction, but like

some really bad thing happens from AI systems  maybe that would then result in it becoming

stigmatized and like politically infeasible to say anything positive about it or maybe like

if there is mass unemployment from automation  or something like that. Who knows what kind  

of political currents might arise as a result  of that. So this used to be like

less is become more likely than it was because  there is now more agitation for like stopping AI  

and AI pause and stuff like that. I think it's  still certainly not the medium scenario, but  

not something one can fully dismiss, and I  think one needs to start to be a little bit  

concerned about that. Right, so, you're in some ways concerned about that. You think that 

it's important that we eventually build superintelligence, that's yeah, essential for avoiding  

other existential risks. But you think that like the current speed, I mean many

of these companies they're talking about super  intelligence in just a few years. Do you think

that's, I mean would you prefer it to be like a little longer than that, or like uh, do you think  

it's good to just steam ahead? Well, I'm actually writing a paper on that at the moment,

working on it. So, I might have a better answer in a few weeks or a month. Whenever I have finished  

this work. So there are, yeah, various  variables that would come into that.

There is also, I guess, a distinction one can make between what somehow would be the ideal  

from some point of view, in terms of the timeline and what it makes sense to push for  

in the real world, in the situation we are now.  So even if you know, maybe you hoped that it

would take a little longer or go a little faster that doesn't immediately mean that, therefore it

would make sense to start to go out and call for a moratorium. Let us say, for one, you might worry  

that if you started to implement what was like  advertised as a temporary moratorium, let's like  

just all suppose you could just like pause for one year. Everybody around the world working  

in AI labs take a year holiday, right? And then they  come back so we'll have one more year okay, maybe  

you think that would be good, but then you might  worry about how likely is it that we will then

restart after one year. Like so, you have to think, how would you possibly get such a pause? Well,  

it seems either you would have to have like some massive sentiment pushing for this like,

and then why would that sentiment  not still be there after a year,  right?

Like it might just harden or some huge maybe or in combination with some huge regulatory  

apparatus, maybe like some international treaty  or regulatory agency to actually implement this, right?

Which would have to be pretty strict because you know AI development can be done in  

many ways and even if you limit the compute power  people can still work on better algorithms you  

know, on their own you know whiteboards,  and so once you put that all that in place  

like these things have a tendency to kind of entrench themselves, and to like it's...

sometimes it's easier to sort of create regulations than to remove them, so even if  

one did think that it would be better if we like  had a little bit more time, it doesn't follow that,

therefore, it wouldn't be sensible to agitate  for like pausing AI for that duration of time.

So not even like for pausing in order to focus more on the alignment problem for some time  

and make sure that we solve it before we reach  superintelligence? I wouldn't advocate  

for that today. Now, I think what is plausible  that at some point it would be valuable for  

whoever is developing superintelligence, whether it's like  some company or some country, there's

going to be some project right at some point that is like blasting into this realm of true superintelligence

I think probably it would be nice if whoever does that had the opportunity to pause or  

go slow for a brief period of time. If they could spend a few extra months or maybe a year or two to  

really double-check all their safeguards, right? And maybe increment the capability slowly,

rather than immediately like cranking everything up to 11 and just seeing what happens, right?

That does seem very valuable from a safety point of view and there's a probably also

a bunch of safety stuff that you can really only do once you have the system that you're trying to  

make safe. Right now, we have limited AIs, and we can work on the safety for them, but how do 

we know that the techniques that work today will  be relevant or apply to this future system that is  

superintellent, once you actually have the system in some sort of constrained form, you probably can  

make more rapid progress on AI safety for some  period of time. So, it would be valuable, I think  

if they had a little bit of breathing room in that scenario, and so at that point some  

kind of short pause becomes more desirable.  Now, the ideal scenario for that might be if  

they actually just simply had a lead over their competitors. If they were like half a year ahead  

of the nearest other lab. Then, they would have the  opportunity to slow down for half a year, right?  

And that kind of pause seems to have less risk of becoming permanent  

because it's self-limiting, like after half a year, another AI lab catches up. Now if it  

still seems sufficiently risky, then maybe that lab also decides to pause development or

delay deployment, right, but then eventually, another  another so eventually that kind of pause expires  

and it seems to have less propensity to sort of just accidentally become permanent, but these

things are very complex and it's not as if I have  like a definitively fixed opinion on these,  

this requires kind of continuously evaluating things as we get more information about safety,  

about political realities, like the strategic  landscape, what other risks are there,  

like all of these factors have to ultimately come into all things considered judgment about these things.

Yeah. So, it's interesting that you're saying that in some sense we might have

to really have superintelligence to be able to know  how to control it, like to study it enough to  

solve super alignment. Well, I certainly don't  think we should just wait until then to start

working, it's just that it's easier to make... there's a lot of safety work that is  

possible to do now that you couldn't have done  10-15 years ago. So back then, you could do

like you could do various theoretical work, conceptual work. But now, we actually have  

these large language models and you can see how  they behave. You can do these various experiments.  

You can work on mechanistic interpretability to try to get better techniques for understanding how  

they represent the world and their goals, and how it's shaped by different... there's a lot more  

handles on the problem now and and I presume like when you have the actual system in  

front of you to finally nail down the architecture  and now it's just a question of, I don't know,

scaling up the thinking time, or maybe like adding  more comput like at that point you get an  

increasingly like clear view of what it is that you're trying to make safe. And even when you have

the final system, like it'd just be nice to have maybe have already prepared some automated tests

that you want to do, like some test suite, like we're starting to have today. Like so, even if you

just had one extra day, right? Maybe that would  already give you like some significant little bit  

of added safety because you would at least have a  chance to run the test suite that you had prepared  

in advance. So like there might be a very high premium on at least having a little bit of

time at that end stage, and like an extra week, then might be a lot more valuable than an extra week now.

But how would you know that you are at this end state? Because I mean isn't it

there a possibility that already now, if we scale up things, that we could reach superintelligence

with the current architectures?  Yeah, so, that's like maybe makes current AI  

safety work more relevant than if we thought ultimately, it would be some completely different 

architecture, but presumably you will be testing these systems quite regularly, as  

you develop them because like it costs a lot, so you don't want to just start some big  

process on a data center that cost you billions  of dollars and then look back a month later and  

see that it's fizzled out, like you want to keep close tabs on how they perform on various

internal benchmarks, and test kits that you have right as you're training these and

as you get closer to actually transformative AI, it becomes also more important from a safety point of  

view, like to see, so maybe you could see that, "Wow  this new architecture scales very differently," so  

that now every day that we train it like improves by X amount. Looks super 

impressive it's not plateauing maybe we can then  predict that probably if we keep this going for  

another few weeks it will sort of, you know. reach  IQ 130, then 150, and like now, it's still going  

strong, okay! And it's no signs of slowing down. Then maybe you would know, okay, looks like

this actually could be it, and maybe that's when you would, if you had this time to burn, maybe  

decide to use some of that to do whatever final sprint you could on the safety front.

Yeah, I think many people here are probably interested  in your thoughts about, yeah, first of all,

timelines, like how quickly do you think we will actually reach superintelligence when we're  

talking about different future scenarios  here, but like, what is your prognosis? 

Well, I take short timelines seriously, including very  short timelines. And I think we are now, and have  

been for a couple of years, in a situation where  we can't exclude even very short timelines. 

It probably will take a bit longer, but we can't know that it couldn't happen in like 2 or 3 years.

I mean, in fact, we can't really be that sure it's not already... I mean, right now  

for all we know, right in some lab maybe this like guy you know, working the night shift has like

figured out this 'big on hobbling' thing that just, "Wow this was the thing we were missing," now the  

same giant data centers that previously strained  to reach like Claude 4.5 level or ChatGPT Pro 5

like whatever, right, that like now with this new tweak, they just learn way... they get the  

same sample efficiency as humans have. So that  with their massive amount of data and our sample  

efficiency, they'd get like, you know... it  could happen. It's not very likely, but if it  

were happening right now, we would not necessarily  know of it. Now, I think so we need to  

start to like take into account the possibility  that there could be some surprise or it could  

happen within a just a small number of years.  I think probably it will take longer as I said,

but we can't be confident with that. Right, and by longer are we talking, decades or

or some more years? I mean it's like it's so impressive the rate at which things have been improving.

If things were to take decades, like I guess one then thinks what could possibly be the  

reason for that? So one is obviously some kind of  external factor like some geopolitical disaster  

or this like 'Stop AI' movement gaining steam  and sort of shutting down, like that's  

one type of way in which it could happen. And  another is that it could turn out that a lot of  

the gains we've had to date, the rapid progress  we've seen has been completely dependent on the  

rapid increase in computing power that we've  had. So, and if you look at it, you see like a

large of it it appears that maybe roughly speaking  like half the progress has been due to algorithmic  

advances and have to increased hardware, but  it might be that the algorithmic advances  

themselves are kind of an indirect consequence  of hardware. You can run more experiments if you  

have better computers. There's more incentive  to actually work like for smart people to work  

on the algorithms like if it's an important  thing. So, but suppose it turned out that the  

real driver here was just hardware scaling and  that you need to sort of scale it up by an order  

of magnitude to get like a constant number  of IQ points as it were in capability, right?  

Then progress might soon slow down because we have now reached the levels of hardware investment  

where there is a limit to how much more they could  grow. So if you're talking 'Stargate,' the data  

center OpenAI is planning to build for $500 billion  dollars. Well, I mean you could you could go a bit  

higher. I mean maybe you could spend $5 trillion  in theory, right? if you really thought that was  

the final push, but after that it gets really hard,  right, it starts to become a very large chunk of

the world economy like another order of magnitude  of that it's kind of half the world's GDP so at  

some point that has like the to the extent that  faster hardware available for AI research has  

been driven by increased capital investments  that will have to it seems start to slow

down at least a little bit soon, and so at that  point if that was what was like driving this  

rapid progress. You could imagine if we haven't  reached superintelligence already by that point,  

then maybe timeline starts to stretch out. Maybe what will then need to happen is that we  

have to wait for some theoretical breakthrough  that makes it possible to do this way more  

efficiently, that could happen. Okay, and do  you think that this is likely or? I mean, I  

think less likely than the alternative. I  mean, it does look like we are sort of  

within striking distance it seems to me  with the continued scale-up and everything  

but you know we've never done this before. So  really we have to think in terms of probability  

distributions here. So, it sounds you think  it would be more likely than not that we have  

it within a decade? Certainly, if I guess, there  are two ways of thinking about this, is like the  

inside view. If I just look at the algorithms,  the progress, the specific things, then I  

would say yes. Then there's this second way of  looking at things, which is like, you stand back  

you maybe have young children that go to school. You have like look at all the people  

who go around their ordinary business and you  think do I really think that the world as  

we know it will end completely? Within less than  10 years and that you know pension funds are just  

wasting their time and that nobody should build  a railway line because it's never going to need... 

like that is that. So then, there's like some, I guess sanity penalty that comes from this kind

of more common sense perspective, and so my actual  views are some sort superposition between those

but this I don't know, this kind of common  sense, prior influence is not strong enough to

I think overcome the inside perspective fully on these things. All right, so is it

like it's you think that it will happen soon,  but you don't feel that it will happen  

soon, or you mean that you should really take into consideration this, like sanity,

for check perspective as well? I don't think it should dominate your thinking. I think

you shouldn't completely lose touch with it.  So I think like from a practical point of view,

I would like take this very seriously and spend  most of my time maybe operating on the assumption  

that we are, exactly how many years is hard to  say, but like we are kind of approaching this  

critical juncture in human history. But then if  there are things that can sort of hedge your bet  

like would you not want your child to have a good  education because you just assume that computers  

will anyway do everything. So probably not like  it seems like do some sensible thing just in case.  

We turn out to have been a bit nuts and crazy  about this thing, as so many other people have been  

throughout history about so many other things.  Yeah, right, so given on these like the at  

least a high possibility of quite short timelines,  and also, that we haven't really solved the  

alignment problem, even though you said we've made  some progress, like what is your P(doom)?

What is the probability that it all goes... Yeah, I  haven't really expressed a particular P(doom)

because I think it might also turn out to depend quite a lot on your definition of doom.  

So I think you can sort of imagine one class of  scenarios where things, just completely, like all  

the values lost kind of thing, clearly doom, and  then like another utopia, right so not doom, but  

then I think there's a a broad class of scenarios  which are such that even if we could now see  

exactly what will happen we might still be unsure  whether to count that as doom or not. Like it might be...

Future is really weird, and some of the things we  now thought were really important and valuable  

have disappeared. But other things have appeared  that might also have value and we might not really  

have some easy way to tote all of that up. Like how much is it worth to get rid of factory,

farming, or to get rid of third-world poverty,  or to get rid of cancer, and then maybe you lose  

some of the various things that humans think  are very important, like I don't know the idea  

of being a breadwinner or having discrete minds  in separate crania as opposed to one big blob,  

or don't know exactly what it would be, but  quite likely the future will be very weird in  

such a way that it might contain both elements  that some people might regard as negative and  

and other people might regard as positive. And so,  I mean in fact, I think that maybe, that middle  

possibility might be perhaps more likely than  either the clearly not doom or the clearly doom.  

Okay, so, but if we're speaking strictly about the extinction scenario that 

everyone dies, like do you think that there's a  high possibility of that? No, I mean, so then

there is another complication that comes in here,  which is that even if we have a misaligned AI that  

doesn't care about us, and that has the power,  if it wanted to eradicate all humans.  

There might be other reasons, instrumental  reasons, for that AI not to do so. It might,

for example, want to be cooperative with  other AIs that might exist in the universe  

that might care about either humanlike creatures or care about general

ethical principles or norms of cooperation. And  because it would be very cheap for an AI to  

to preserve humanity you know, maybe even give  us the whole planet or the whole solar system,  

right? There's a lot of space out there. They  could get like 99.99999, for a lot of 9s of  

all the resources for its preferred use and still manage to keep humans around in like some  

kind of paradise-like environment for us, if it  so wished. So, it wouldn't require much it seems,

either of intrinsically caring for us a little  bit or placing some instrumental value in

having us around that might then result in even the case where you get a radically misaligned AI  

that ends up in a position of absolute power where  you still don't get human extinction from that.  

Okay, yeah, it's very interesting because  when I read "Superintelligence," like it  

feels as if your views on this might have changed  a little bit because then you talk about it  

really as, I mean that it would really want to  as quickly as possible start to colonize the  

the universe because if it waits, it will automatically lose many stars and galaxies,

just because it waited a little bit longer until  it started sending out probes, like because there  

parts of the universe that it can't reach anymore.  Do you think that this will be outweighed by

this theoretical consideration? I mean, it  might want to reach quickly, but hopefully  

not killing us would not cause much of a delay.  in that regard, I mean, you can still have some  

massive place in the desert where like huge  starships go up, or you do some fusion

like whatever, right? That doesn't require wiping  out all humans around the planet. Even if you  

sort of wanted to like invest massive resources  in just getting something out there that could  

then start a sort of self-replicating process, spreading through the galaxy and beyond. If you

want it to be even more economical, you might even  imagine uploading humans and kind of continuing us 

on a more efficient substrate, or like there are  various possibilities. But it does complicate the  

the question of assigning a probability to  human extinction because it might be quite a  

different probability than the probability of  misaligned AI taking over. And then it  

becomes like whether you then count that as doom  or not then, might depend quite sensitively on your  

value function. Like if you mostly care about  people, normal people with normal human values,  

having great human lives, living happily with  their friends and family, and doing the humanlike  

things with art and cinema and perfect medicine  and like all of these things. If that's most of

what you care about, then this, in some of these  scenarios, like things might be like 99% of  

as good as they could possibly be. If on the  other hand, you're like a utilitarian of the sort  

that cares about the total amount of utility in  the universe, and you would want to transform  

all these galaxies into hedonium or something like  that like matter optimized for feeling pleasure  

then the scenario might be basically as as bad  as total loss because like Earth is in a  

significant crumb in this vast sea of resources.  So if all the rest of the stuff were used to,  

I don't know, make paper clips or what whatever  the AI happens to want. If that doesn't happen  

to coincide with what this kind of aggregative  consequentialist value perspective would want,  

then it might count as as basically rounds to  zero. So you might get a radically different  

opinion on this scenario, whether it counts  as doom or like an amazing success, depending on  

which value function you have where different  people might like choose these different value  

functions. Neither of these value functions is so  crazy that nobody would have it like so, that's an  

example of what I was alluding to earlier that  there might be this big category in the middle 

where whether it counts as doom or not depends  quite sensitively on how we evaluate it in ways  

that might not be very clear or obvious to us now  even if we could see exactly how things would  

play out, which we can't. Okay, yes, I still find it interesting because it really feels

quite different from reading your book.  And I wonder if like this idea that it

would want to care for humanity because  of some future interaction with other Super AIs

and corporation. I mean why why would it  specifically care for us and not other animals  

and then see us as sort of parasites on  the Earth, or like, is this really something...

it seems like a very important question if you  really believe that this is likely  

to happen that it will sort of leave the human in  peace or if it's more likely that it will do  

something that it just won't care about us? I mean that that seems to be very-very important  

to sort of straight out. Yeah, I mean, so I didn't  express like a likelihood on this, but

it like one reason is that there might be other um  civilizations out there that manage to align their  

AI for example. I would hope that if we align our  AI, it will at least care a little bit about  

other creatures out there on other planets, other  ape-like creatures or octopuses that became

sentient and develop a tech civilization, and like  if our super intelligence like eventually in

500 million years we come across some octopus  civilization. I would like to think that we  

would then want to be nice to them. And so  if there are at least some AIs like that around  

then they might engage in trades and stuff like that. That you know would  

promote these values. That doesn't seem  that unlikely, and what exactly will happen  

in that space very hard to predict, but at least  it seems like a live possibility that from our  

current sort of occluded perspective we are in no  position to dismiss. I would also say by part  

of explanation so you notice like some kind of  I don't know tonal shift or shift of emphasis or

something like that compared to superintelligence  and it is true and and part of that is the

context. So when I was working on that book, um  the whole issue of AI safety was completely

ignored by basically everyone. Certainly nobody in  academia took it seriously, aside from Eliezer and  

like a few other people on the internet. The whole  world just dismissed it as science fiction or like,

and to the extent that people were interested  in AI, the only focus was like how can we actually  

get it like so, we have academic departments  trying to make progress in this and that and

so at that point, that seemed to like a clear  deficit on the side of understanding what we  

were actually working towards and understanding  there could be risks which seemed really important  

because if you could understand that, then we  might also actually use the time available to  

prepare ourselves to avoid the pitfalls, right? If  you have some conception of what could go wrong,  

you might take action to prevent things from  going wrong. In the intervening years,  

there has been a big shift and now there is  much wider recognition of the idea of AI safety  

as being important, like it really has become  part of the mainstream conversation, and  

including among sort of serious people.  You hear world leaders talking about this,  

you hear the tech leaders of big companies,  and as I said, the Frontier AI Labs now have like

research groups working hard on this, and there  are a bunch of other organizations as well. So,  

so the situation on the ground has changed a lot.  So now, I think there is less need for 

me to keep harping on the same thing again when  that point is already quite widely recognized and  

so I'm focusing more on like other insights  that maybe I haven't yet sort of percolated,

as widely and trying to bring those to people's  attention. Right, right, so, your focus is now on  

more on other things. But still it would be  interesting to just get like sort of a feeling  

because I mean, you were obviously quite  worried about this when you wrote the book and  

some things have changed now. Like given  everything you know and the situation we're in. 

If you really had to put a number or  or at least a range on the probability of  

of like complete human extinction would it  be like a two-digit number, your P(doom)?  

Well, yeah maybe, but I don't know. It also  might depend on like, sorry to be this kind of  

"Depends on what you mean, the definition, like depends  on what you mean by human?" But it might depend on  

what you mean by human like if there are like only  uploads left then no biological humans for example  

Does that count? I think maybe it  wouldn't count. I mean, or do you think  

that we or like that I would be myself if I was  uploaded? Under certain conditions that  

I would think so, and in fact plausible to me, that the best path involves uploading at some point 

you know. I would favor a world probably where  different people were free to choose their own  

trajectories. But ultimately, I don't see  why we need to use meat to do the computation.  

When like semiconductors, might be ultimately  much more efficient. But all of these things it,

would be nice. I think like if we tried now,  just to make up our mind about a host of those  

kinds of questions, we would kind of be bound to  get at least one of them wrong. And so what  

we would hope for, I think, is maybe to end up in  a situation where we're able to like think a lot  

harder about these things and deliberate, maybe  with AI advice, rather than having to sort  

of implement our current best conception of the  future we would want and then locking ourselves  

into that. I think that likely would sort of miss  out on a lot of really exciting possibilities.

So that's one, there are other factors as well. Well, I don't know how much time we have.

Like I have one recent paper. It's not really a fully developed paper, but "AI Creation and the Cosmic Host,"

it's called, and which introduces, it's quite handwavy, but the idea being,

that if we give birth to superintelligence, it  will enter a world in which there quite possibly  

are other super beings already in existence.  So these could be other AIs built by some alien

civilization in some remote galaxy. They could be  like in the average interpretation of quantum

mechanics, there are many branches. And so there  might be other branches of Earth originating  

life that have produced or will produce different  forms of superintelligence. If the simulation  

argument, right, is to be trusted. We  may be in a simulation the simulators  

then would be presumably super-intelligent and  be super beings. And of course traditional  

theological conceptions as well, right? God is a  super being, usually super intelligent. And so in 

either of these cases that there would be this  kind of cosmic host consisting of these other

super beings. And one important decider for us  going forward here I think is that if we create

superintelligence, we would want to make it such  that it can get along with this cosmic host  

and maybe adhere to whatever norms might have  been developed within this cosmic host. And so  

that I think adds a dimension that has to some  extent been missing in the classical discourse  

around AI safety where the attitude really has  very much be how can we sort of maximally control  

the AI to implement the highest degree of our own  expected utility maximization, just taking our own  

preferences into account. There might be this much  larger picture where we are very small and very  

weak and very new, and there is like this kind  of incumbent set of super-powerful beings and how  

our superintelligence interacts with that might  be like a very critical part of how well things  

go and so I think the upshot of that is a bit unclear. We don't know precisely what that means  

but I think it slightly increases the chance that  we ought to develop superintelligence and also  

I think that we should approach it more with  an a little bit of an attitude of humility that  

that we don't know very much here. There's  more things in heaven and on earth than is  

dreamt of in our philosophy, and we are, and so I  think it's like there is some different mindset

that maybe also comes from considering that type  of perspective. Okay, yes, yeah. Obviously a lot  

of things to consider here. Well, I'll try not  to stretch out our time too much because I  

know you're you're sitting up late now. But  can I just, because I feel that I still

don't have any clear idea of where  you are on this extinction scale, really, and

this is something that I think worries me,  and lots of people, like would you say that  

your view now, like you still acknowledge  that there is a risk that everyone just

dies, but you don't think you think that's  a risk, not like a major risk anymore?

No, I mean, it's like a pretty serious outcome if it happens right, so it's like  

So, like even if the risk was quite small it would be worth

taking seriously, but it is worth bearing in mind if we are kind of concerned about our own

personal deaths that it is something that is  likely to happen anyway, and then you might  

say, well, it kind of matters when it happens right?  It's not just whether you die, but like rather not  

be too much, and so then you might think in terms  of life expectancy, but then if you actually  

start doing the math on that, very plausibly  our life expectancy goes up dramatically if  

we develop superintelligence even if like  misalignment risk is quite high because if AI

is developed and it's successful, if it's safe,  then it could do a lot to advance medicine, you  

know, invent rejuvenation therapies, anti-aging  medicine and so forth. And so our lifespan and

conditional going well, would be very long. So  yeah, you can start to do... that goes back to this  

paper that I'm working on, which I'm like,  have more to say on that, hopefully  

in the relatively near future. But the  upshot is that even if one is concerned mostly  

in terms of personal survival, then even then it  is not at all clear that should make us  

kind of anti-AI or like wanting to slow  it down by a huge amount. All right, but I  

suppose that you can make that argument even  if the risk was like 99% that everyone would die?

And I mean, most people would probably not  like to take that bet. Well, if it's that big,  

then the argument might not work. I mean, then  it becomes more complicated. So, you might think  

of esoteric ways in which we might achieve very  long lifetimes even if we don't develop AI. But  

also, you might not value future life expectancy  linearly. You might have a kind of diminishing  

returns or time discounting. And so then and especially if you thought that the risk was

going down over time, if the risk now was 99% but  it was falling by like 5% a year or something,  

you would prefer to wait. But at some point  once the risk is low enough or once the rate of  

further decline is low enough, depending on various  parameters, then you would want to favor taking the  

plunge. So, but there are other things, of course  like aside from our own personal survival, we might  

also care about other things, like the survival  of the human species or Earth or anything like  

the kind of human values in the broader sense, and  other things. So it's very hard to form  

like an all things considered view about these  big picture questions. I call it kind of macro  

strategy, like trying to figure out how all of  these different pieces fit together. And it's

it's quite plausible that we have overlooked at least one crucial consideration like  

some like fact or argument, or idea  such that if only we discovered it or fully took  

it into account it would kind of quite profoundly  change our opinion not just about the details  

about things, but like about the overall direction  that we should be trying to go in like should  

we want more global coordination or less faster AI  progress or slower? Like do we want more synthetic  

biology or less, more surveillance or less? Like  these big macro strategic parameters, it can be  

very hard to form any kind of confident opinion about what the answer to those kinds of questions is.

Yeah, right. Thank you for all of these  very interesting thoughts. I realize that  

there are are many things to think about here.  Would you have time for like one question from  

the audience? Well, let's do one then.

Okay, if we have one. Someone who's eager to ask anything. Yeah, we have one there.  

You can say it, and I can repeat it because he's  hearing this mic. Do you think that we'll  

take advantage of our own super intelligence  that we create to fend off threats from other  

superintelligences that could reach us? So if  if we use Neuralink to sort of merge with  

superintelligence. If we can, then use that. Yeah? So this will be like an evolutionary change. 

Is there a possibility that we'll merge with AIs somehow? Yeah, my guess is that  

would happen after we have radical superintelligence  that then perfect these technologies that would  

allow really perfect brain computer interfaces  or uploading or other things of that sort. My  

guess is that AI is moving quite fast and  there are all kinds of complications with  

like actually implanting things in the brain  that sort of sometimes gets elided in media  

headlines, like you get, like risk of infection you  get like it moves around a little bit, and you have  

to... so, I think it's great for people with  disabilities who can now, you know, walk or see  

or something thanks to this implant but it's quite  difficult to do something that is better than like  

a normal human. So we can interact with computers  already through our eyeballs like a 100 million  

bits per second directly into our visual cortex,  large part of the brain, customized specifically  

for processing this information. Output is a  little bit more limited, but still, I'm more  

limited about my ability to think than my ability  to sort of speak and type, usually, right? It's not  

as if I could like be a 100 times more productive  if I could type 100 times faster. So that would  

be my main guess. Now it is possible that if you  have high bandwidth interfaces, the brain could  

learn to somehow leverage external computational  substrates in ways that would unlock some kind  

of synergy, like if some large pool of  external working memory that the brain somehow  

could learn to access, if it were like in high  bandwidth communication with it over a long period  

of time. You can't completely exclude that or  like if you had many humans that kind of could  

telepathically communicate. Maybe initially,  it would just be telepathic communication, which  

would not really be that different from speaking  or something, but maybe if they were like connected  

like that over a long period of time, maybe  they could eventually learn to have thoughts  

that sort of... like a single thought, complex  thought, that sort of expanded over multiple  

brains without the whole thought being located  in one of them. It's possible that  

these more interesting things, but it just  seems like if your timelines on AI is short,  

like some single-digit number of years, it's just  hard to see how these things would kind of happen  

on in a way that would really meaningfully change the needle, within that time scale, would be my guess.

Okay, yeah. Thank you very much. Let's  hope then that the superintelligence will be  

kind to us. And it's been so great to talk  to you. I'm really sorry for sort of pushing  

a little bit on time. It's just so extremely  interesting to hear your thoughts on this, and  

I think we should give a big round of applause  for Nick Bostrom. Well, well, thank you, everybody,  

and it was fun. Thank you, Jonas. Thank you.  And yeah, have a good night now, and yeah,  

good luck with all the research. I look forward  to reading your papers also in the future.

Loading...

Loading video analysis...