LongCut logo

AWS should not be broken up

By Theo - t3․gg

Summary

## Key takeaways - **US East 1 Outage Impacted Major Services**: A widespread outage in AWS's US East 1 region caused numerous services like Snapchat, Netflix, and DoorDash to go down, highlighting the web's reliance on this single data center. [00:04] - **DNS Issue Caused Initial AWS Failure**: The AWS outage was initially triggered by a DNS resolution issue with the regional DynamoDB service endpoints, impacting services within the US East 1 region. [03:02] - **Breaking Up AWS is a Bad Idea**: The idea of breaking up AWS is a bad idea because it would force every company to reinvent complex infrastructure, slowing down innovation and the entire tech ecosystem. [01:13], [16:26] - **Cloud Infrastructure Levels the Playing Field**: AWS provides infrastructure that allows small developers and large companies like Netflix to operate on the same level, fostering competition and innovation. [15:42], [19:39] - **Outages Make the Internet More Resilient**: Major outages like the one in US East 1 incentivize companies like AWS to invest heavily in resilience, ultimately making the internet more robust. [12:29] - **VPS Hosting is Not a Viable Alternative**: Relying on cheap VPS hosting is often riskier than using major cloud providers, as it lacks the reliability, scalability, and redundancy necessary for real-world applications. [07:30], [11:09]

Topics Covered

  • Why VPS hosting is a delusion for real software.
  • Does an AWS outage make the internet more resilient?
  • Abstractions accelerate innovation, they aren't 'just wrappers.'
  • Cloud democratizes infrastructure, not monopolizes it.
  • Capitalism made enterprise infrastructure accessible to all.

Full Transcript

You might have noticed that pretty much

the entire internet was down yesterday.

Yeah, like the whole thing. I couldn't

even order food for my cat here. It was

nuts. Everything from Snapchat to

Netflix to Door Dash to McDonald's was

entirely down. How does everything go

down at once? Well, AWS. So much of the

web is powered by AWS. And so many

services and apps we rely on every day

are all based out of US East one from

Amazon. So if that goes down, everything

goes down.

even if my cat wants to be in my face.

Be thankful this guy's not your IT

admin. There's a good chance you've seen

this one covered in many different

places by now. There are so many sources

and other great creators and YouTubers

that have broken down the depth of what

happened here and why it happened.

Everything from DNS to Dynamob to

multi-reion failures, etc. But I want to

talk about this for a very different

reason. I want to talk about this

because I saw this post from Elizabeth

Warren and I'm from Elizabeth Warren

state, Massachusetts. Normally I try to

not talk about politics here and I'm

going to do my best to not make this

political, but I want to explain why the

current state of the cloud is actually a

really good thing. Why it's awesome that

we have services like AWS, GCP, and

Azure to rely on. And why the idea of

breaking these companies up after

today's outage is a really, really,

really bad idea. All of that said,

someone's got to pay for this cat's

food. So, we're going to take a quick

break for today's sponsor and then we'll

dive right in. Your engineers are really

fast. Your GitHub CI probably isn't. At

least unless you're using today's

sponsor, Blacksmith. These guys really

get GitHub CI in actions better than

almost anyone I've talked to in my life.

I just got back from one of their events

and I had such a good time chatting with

these guys. The thing I didn't realize

until recently is that they're not just

the best cheapest alternative to GitHub

actions. That's literally one line of

code. Like, that's enough of a reason to

use it, right? But where it gets really

fun is the observability stuff. As you

guys know, as I complain about a lot,

GitHub doesn't really improve their

platform. The fact that GitHub actions

still work the exact same way they have

for like almost 10 years now is

insulting. There's no way to see how

high your failure rates are, where

things are failing in the pipeline, and

just to get like an overview of your

actual work. You just scroll through a

list of failed or past jobs. Useless. I

mean, actual observability into your CI

is unbelievable. And the more I've heard

people moving over to Blacksmith, the

more I've been hearing them say this is

their favorite part. It's so much easier

to figure out why things are failing,

sort through your logs, fix failing

tests, and so much more. If you're like

me and you've been shipping way more

code lately, Blacksmith's going to make

your life significantly easier. It's

free to get started. It's way faster

than GitHub, and it's cheaper overall,

too. It's really hard to go wrong with

Blacksmith. And if you don't believe me,

check them out now at

soyv.link/blacksmith.

Here are the three key things I want to

focus on for this video. real quick

overview of why it happened. Then we're

going to go into why you should still

use AWS despite this. And then we'll

wrap up with my response to that

Elizabeth Warren post, why I think it's

stupid to break up AWS. So let's start

with why it happened. There's a lot of

varying coverage. I think that this

article from register is one of the

better options. The TLDDR is that there

was a DNS issue through Dynamo that

caused this first layer of failures. US

East has multiple redundant data

centers, but there's an abstraction on

top to route the traffic to those data

centers. in that abstraction layer, that

DNS zone failed. According to Amazon

themselves, AWS experienced increased

error rates on AWS services in US East1,

which impacted Amazon.com and Amazon

subsidiaries as well as AWS support

operations. This is between 11:49 p.m.

on the 19th and 2:24 a.m. on the 20th. I

was out partying at TwitchCon, which is

really funny because my bed would have

woken me up otherwise. I was sleeping.

It was 2 a.m.ish and I was alerted about

the issue because my internet mattress

which I love, shout out was warm which

led me to opening the app and the app

for his eight was having issues because

he of the AWS outage and he couldn't

change the temperature of his bed.

Totally not a problem I've had before.

Definitely not a thing I've experienced

in the past. By the way, I do have an

affiliate link if you want to sleep

better than you've ever slept. Soyv.leep

link in the description. Sounds insane,

but having water cooled bed that changes

temperature throughout the night is

actually really nice. I bought it to

make fun of it and now it's like my

favorite thing in the world. I missed it

dearly during my trip. So that was when

the outage happened late at night. The

cause of this event was a DNS resolution

issue for the regional DynamoB service

endpoints. It was mitigated by 2:24 a.m.

After this was resolved, AWS services

began recovering, but a small subset of

internal subsystems continued to be

impaired. They are still investigating

what caused this second layer of things

that meant the outage was continuing. It

took until 3:01 p.m. for all things to

be properly fully restored. There's a

lot of layers why that second set of

failures happened. It's probably a

combination of all the work that was put

off due to the fact that the outage was

occurring, all the lambdas that were

cued, all of the work in SQS, all the

different things people do on AWS. It

probably like thundering hurtded itself

into death over and over again some

amount. But we'll wait till we have a

more thorough update from Amazon to be

confident on that. There are also claims

that this is a result of the brain drain

as more and more quality people from

Amazon and AWS have been leaving and now

the best engineers aren't there and

who's left might not remember how DNS

works. As we know, it's always DNS.

Yeah, that's the TLDDR of the why it

happened. Engineers everywhere

pretending to monitor the situation

while refreshing the AWS status page.

This was so real. This was us for T3

chat, which yes, was down. It's funny

cuz we were not in US East until I moved

to Convex. So technically, this is once

again Convex causing me problems. As

silly as that is, Convex is not the

problem at all. They made the right

choice here. Also, Hner trying to get

dunks and somebody who I have blocked as

the starting point. Very annoying cuz

like yes, Verscell was down like

everything else is there. Yeah, I guess

this is a good transition into step two.

First off, thank you Sam for replying to

this. Well, I love Sam. If you don't

know him, he makes incredible articles

about all sorts of different stuff. He

did the load balancing article that we

did a video about. He did the retries

one and the queuing one. So many great

articles. I highly recommend checking

his stuff out. This is a good post.

You're not down today, but one day you

will be. You work under the same

constraints everyone else does. You know

how stressful it is when it happens to

you. Be better. Yes, betting on someone

else doesn't make the problem go away.

It's yet another bet. every additional

layer of bets does make things riskier.

So the fact that for me there is AWS as

like layer one. So we have AWS

US East1 and then for my database I have

convex as a layer. But even convex is an

abstraction on top of planet scale

metal. To be truthful

metal is very separated from AWS. It is

in the same regions, but they are

renting shelf space and like racks that

they put their own hardware into. So,

they're as much to the side here as they

are like another layer. I know a lot of

people who are on planet scale had no

issues despite the fact that AWS was

[ __ ] itself. So, this is my stack.

So, now if any piece here fails, T3 chat

on top is a tiny little box is screwed.

And when you add more layers, you do

potentially increase the surface area

for problems to happen. But there are

certain ones like Planet Scale Metal

that tend to increase reliability rather

than decrease it. But that leads us to a

question. Does using the same servers as

Amazon increase or decrease risk? The

fact that people are currently

unironically saying that it is riskier

to use Amazon servers than it is to use

some random company hosting oneoff

servers for way too cheap in Germany.

That one power surge could cost you all

your data is pretty absurd to me. It

just no. It makes no sense. If this was

the obvious correct path, you would have

companies like Netflix doing it. Why

would Netflix, a multi-billion dollar

company, do this wrong? If you genuinely

believe it is easier and safer to host

VPS's, why is it only random hobbyists

and people who have a service with a 100

users talking about it ever? I've never

seen real production apps talking about

this outside of a little bit of stuff

that's slowly dying from the DHH camp.

Like, I hate to be so direct about this.

It's just kind of silly that none of the

people building real software are

building this way. I've never seen

someone proudly bragging that all of

their stuff is on Hetner or is on some

random modern VPS solution that doesn't

have endless problems or zero users.

Like it's just always the case. Does

this mean you should never learn how to

do that? No, absolutely not. Everyone

should know how to spin up a Linux

server and run code on it if you're

working on servers for a living. But you

should have a general idea of how the

pieces work so that you know what you're

not dealing with when you move to

something like AWS. And no, this isn't

because corporations prefer AWS. That is

[ __ ] delusional. There are companies

that are competing with Amazon that are

still building on top of AWS. They're

doing it because it's the right balance

of capability, price, reliability, and

overall functionality. Like, it's just

duh. I I just I cannot take anyone

seriously who is sitting here saying

that AWS, GCP, and Azure are bad bets

because you could just use a server.

None of these people are building real

software. I'm sorry. Sure, you can sit

here and [ __ ] on me all you want and

say, "Well, Theo, you could put T3 chat

on a VPS, a really big one, but yeah,

maybe." And the moment we have slightly

too many users, we're screwed or we have

to start putting a Kubernetes layer in

front to distribute it. Fun.

By the way, Mammud here made a great

post questioning why someone would

deploy their app to a VPS in 2025 as an

engineer at Railway, which is one of the

few companies that is actually hosting

their own bare metal. They've moving

everything over to their own servers

recently. They are kind of like modern

Heroku. Great company. I have a lot of

friends there. What's funny with Railway

is that they're the company that should

be saying this because to an extent

they're letting you host VPS's, but they

have, as many have, built a much better

abstraction on top because doing the

exact same management and configuration

everyone else has to do, makes no

[ __ ] sense. None. None at all. And

even a company like Cloudflare, which is

basically essential if you have a CDN

and DOS protection, which by the way, if

you're hosting on a VPS, you [ __ ]

need DOS protection. There is no world

in which you're hosting on a VPS and

dealing with insane loads without

something in front to protect it. And if

you have Cloudflare as the layer in

front preventing DOS attacks, there's a

very good chance that layer is going to

go down at some point because it has in

the past due to the fact that they're

using GCP still for the storage for

things like KV. Yes, really. There are

parts of Cloudflare that are using GCP

still. They're working on their way off

it as far as I know, but it's still

there. So, everyone is vulnerable. If

your service is not vulnerable to

outages like this, not saying you're in

US East1, but if your service is not

vulnerable to something like this, your

service isn't real or you're just wrong.

Period. Point blank. End of story.

Everyone is vulnerable to something in

their chain failing or they don't have a

real [ __ ] chain. This is not a time

for us to be like, "Well, maybe we

should move to servers." No, that's just

[ __ ] delusion. It's cope. It's people

who don't know what they're talking

about. It's a bunch of web devs hosting

[ __ ] in PHP that pretend they know how

services work. I'm sorry. just not the

real world. That all said, we should

probably stop doing everything in one

region. Surely AWS does have multiple

data centers within their region, and if

any one fails or has an outage, the

other two are probably fine. But the

problem here isn't that the service

within AWS fails and that the data

center went down. Is that the layer that

routes to that data center did? If

anything, I would argue that it is

incredibly impressive that there is an

outage that can occur at the DNS level

that only hits and affects one region

with AWS because this region has

multiple data centers. If one data

center fails, things happen. If they all

fail, worse things happen. But if you

have DNS routing between those things

and that fails, but it only fails in

East one, that's almost a [ __ ]

miracle. The fact that only one piece of

AWS went down for this is genuinely

impressive. And generally speaking, the

fact that AWS outages like this are so

expensive is more incentive for them to

keep it from happening in the future. So

many companies hide from the fact that

they have outages like this. Amazon

can't Amazon lost billions upon billions

of dollars because of this outage. Do

you seriously think they're going to

just let it happen again, but they're

not going to put a shitload of guards in

place to prevent this from happening

again? This outage made the internet

more resilient than it's ever been. And

that's going to continue to be the case.

AWS is going to invest more heavily than

they ever have in resilience for these

things because otherwise people will

actually move and also their own

businesses will lose money because

Amazon was down for a lot of this as

well. It's just it's genuinely silly to

me to think that this is a reason to

move off of AWS. Like what? Cathode Ray

dude is one of my favorite YouTubers. He

did a great video on Teos recently and

how the company's slowly dying. And NMG

just dropped a quote from him here that

I think fits here perfectly. How dare

they ask me to pay money for a

well-designed, purpose-built device that

does a great job at a specific task that

I value highly. What fiends? I'll show

them by spending the same amount of

money and vastly more time and effort

making something that works almost half

as well. Yeah, this is the problem. This

is how I feel whenever somebody talks

about things that have to do with not

using AWS. Like, oh, cool. So, you're

going to go reinvent everything

yourself. Let's see how that goes for

you.

Let me know when you can actually work

on your software again. I have seen

people sharing this and a bunch of memes

in a similar format claiming that

companies like Forcell are in the end

just AWS rappers. Even Netflix is just

an AWS rapper. Do you know what C++ is?

Give you a hint. It has to be compiled

to assembly. Everything's a [ __ ]

rapper. I am so tired of this goddamn

argument. The entirety of technology is

rappers on top of other things. Welcome

to software. There's almost nothing that

is that bottom level. Literally every

single thing we work in and touch that

is worth using is an abstraction.

Companies selling shirts aren't creating

their own fabric. Engineers writing C++

are not writing their own assembly.

Services giving you access to servers

are not going to host their own servers.

Unless, and this is the key, unless

there's a really good specific reason to

do it. When a company is making

infrastructure like Versel and they

choose to work with AWS, what that tells

me isn't, oh, they're going to

overcharge me for an AWS rapper. What it

tells me is that the problem space they

choose to solve in is different from the

problem space that AWS is solving in.

And rather than try to reinvent all of

the hardworking billions of dollars that

AWS has spent to make incredible

scalable servers for a reasonable price,

they decided to take advantage of that

existing work and build on top of it.

This actually makes for a really good

transition to point three. Why breaking

up Amazon is stupid. This is the post

that inspired my whole rant. According

to Elizabeth Warren, if a company can

break the entire internet, they are too

big. Period. It's time to break up big

tech. As the community note correctly

states here, AWS is not a monopoly. It

represents 30% of the web. The fact that

30% of the web feels so big is pretty

wild, but it does feel like so much was

down. Is it bad that one company going

down can affect other things? Perhaps.

But that's how supply chains work. When

the Suez Canal was blocked, the entire

world was put on hold. Is it okay for a

company to own something? Is it valuable

as the Suez Canal? If there was only

one, probably not. But there isn't.

There are so many options. There are

arguably too many options for hosting. I

would argue the opposite of this point

specifically. Personally, I think it's

pretty [ __ ] cool that some random

vibe coding kid has access to the exact

same infrastructure as Netflix. That is

the magic of what's happened here

because companies like AWS have to put

so much work in to build the products

that they're building. To build AWS is

such a complex thing to do. They decided

early on through an Amazon mandate

through Jeff that anything that was

being used by multiple teams should be

built in a way that it could be

abstracted and sold to external

customers because there was so much work

being redone internally. They realized

at Amazon, oh, we have four teams that

are trying to find a way to store files.

What if we made a generic solution for

storing files and let all of the teams

use it? That innovation they made was

awesome. And if that innovation is a

thing they were forced to keep

internally because they're scared of

being broken up if they sell it, then

the entire ecosystem is going to slow

down because now every single [ __ ]

company has to reinvent file uploading.

Please explain to me in detail why that

would be a good idea. Why we need to

reinvent file uploading again and again.

Trust me, this is not a fight you want

to take with me in particular. Anyways,

it's pretty cool that anybody can use

the exact same infra as the biggest

companies in the [ __ ] world. That I

have the same level of reliability of

scalability of service like

functionality as companies like Amazon

do. Because instead of them taking all

of that work they put into every single

service and refusing to let anyone

benefit from it, they're doing the

opposite. They're buying companies like

Twitch, my old employer, taking their

video info that was exclusively usable

by Twitch, turning it into a service,

and selling it to Twitch's competition,

like Kick. That's awesome. I think that

is great. This is one of the few times

that you can point at and with almost no

argument against it, say capitalism is

working well. Companies had incentives

to build these things for themselves and

then due to the nature of markets, they

had incentive to sell those things to

others. That's a good thing. The fact

that we all don't have to go reinvent

the concept of storing files is a really

positive win for the entire ecosystem.

It's part of what makes the whole tech

world as awesome and productive as it

is. We can build on top of the hard work

of others. If every single service, if

every single app, if every single

vibecoded thing on Replet had to be

built using a bunch of things that they

rolled themselves instead of building on

top of these layers, we would all still

be writing [ __ ] assembly. This is all

on top of the fact that it's not a

goddamn monopoly. There are multiple

options. There's all the VPS bros

shedding themselves, but there's also

Google Cloud, which I have my issues

with. It's a reason people don't like it

as much. Azure, which is making almost

as much money as AWS now because they

charge insane licensing fees. We're also

using Azure for a bunch of our inference

right now, 4T3 chat. There's also

smaller companies I've been showcasing

like Railway that I think are really

cool, too. There's a surprising number

of businesses relying on one specific

region in AWS because it's useful. It's

a good thing to rely on and the

reliability tends to be pretty solid.

I'm going to go a bit of a different

direction here and talk about this, my

new iPhone. This is the iPhone 17 Pro

Max. It's expensive. I'm not going to

say otherwise, but the fact that pretty

much every iPhone user is on the same

effective device with like a plus or

minus 10 to 20% performance gap is

actually a really cool thing. The fact

that a billionaire doesn't have a way

more expensive version of a phone than

somebody who's making 50k a year is

good. Look at almost every other market

outside of tech. Look at cars. Look at

houses. Look at look at flights. Look at

everything else in the world. The

version somebody with a median wage uses

is fundamentally different from the

version that a big billionaire class

person uses.

I think it's nice that that's not always

the case and that you can't spend way

more money to get a way better iPhone.

The best iPhone in the world is like

$1,600. You cannot get better than that.

And the only difference is that it's

more and the only difference is that

there's more storage. Like this is cool.

I really like the fact that something

like AWS allows for you, me, and

billionaires to have the same exact

godamn infrastructure. That's awesome.

This is capitalism winning. It's driven

the costs down so much and the quality

of product so high that there's no

reason to get something bigger and

fancier. You can get abstractions around

the best thing right now. You can get

things that make the DX better with AWS.

Things like Versell, they're not paying

me. I just like using them. But we're

all building on top of the same thing

because that one thing has be

effectively become a commodity. The same

way that you and I drink the same water

and you and I own the same iPhone, you

and I use the same servers on AWS or on

Google or on Azure even, that's a

[ __ ] win. And the fact that you can

buy a $600 iPhone right now and have the

same quality of experience minus, I

don't know, maybe the camera slightly

worse than on my $1,400 to $1,800 one.

Cool. That's good. If you need more, get

more. Who cares? This is this is a good

thing. And it is genuinely annoying to

me that when any little thing goes

slightly wrong, the response is, "Never

should anyone use this again. We should

be breaking this up. It should be

illegal. It's terrible. It's bad. It's

awful." Not saying AWS doesn't have

problems. I'm certainly not saying Apple

doesn't. I could rant about the state of

the App Store for hours. In fact, I have

in the past. Check out my other videos.

What I'm saying is that it's pretty cool

that the level of entry is the same for

everyone and that we have successfully

made good enough primitives to work at

almost every different scale and that

people who are independents working on

small projects are benefiting from the

same new infrastructure and powerful

things that are being built by AWS as

companies like Netflix are. That's a

good thing. Imagine a world where

Netflix can spend billions of dollars

building infrastructure primitives and

then keeps it to themselves and no one

else is allowed to use those and you

have to build it yourself to compete at

all. The reason that there are these

small projects that are competing with

big companies now, the reason a 100

person team can be a real threat to a

100,000 person company is because AWS

has given access to everyone. All of

this said, the fact that Amazon uses

their profits from AWS to discount

things on Amazon.com to squash

competition, that might be worth talking

about a bit. Conversation for another

day. The death of diapers.com through

Amazon subsidies is a thing that is

actually monopolistic and is worth

talking about, but AWS being a useful

service has nothing to do with that. So,

Warren please

for the love of all things American,

don't say this or at the very least talk

to somebody technical that isn't a VPS

bro before saying something stupid like

this in the future. This makes all of us

look bad. In the words of Gurgley here,

AWS didn't break the entire internet.

Companies that decided to build

non-resilian systems depending on one

single cloud region, East One, broke

themselves. Notice how X, Google, Meta,

Shopify, etc., We're all fine. Breaking

up AWS would solve nothing. That's all I

got on this one. Let me know what you

all think.

Loading...

Loading video analysis...