AWS should not be broken up
By Theo - t3․gg
Summary
## Key takeaways - **US East 1 Outage Impacted Major Services**: A widespread outage in AWS's US East 1 region caused numerous services like Snapchat, Netflix, and DoorDash to go down, highlighting the web's reliance on this single data center. [00:04] - **DNS Issue Caused Initial AWS Failure**: The AWS outage was initially triggered by a DNS resolution issue with the regional DynamoDB service endpoints, impacting services within the US East 1 region. [03:02] - **Breaking Up AWS is a Bad Idea**: The idea of breaking up AWS is a bad idea because it would force every company to reinvent complex infrastructure, slowing down innovation and the entire tech ecosystem. [01:13], [16:26] - **Cloud Infrastructure Levels the Playing Field**: AWS provides infrastructure that allows small developers and large companies like Netflix to operate on the same level, fostering competition and innovation. [15:42], [19:39] - **Outages Make the Internet More Resilient**: Major outages like the one in US East 1 incentivize companies like AWS to invest heavily in resilience, ultimately making the internet more robust. [12:29] - **VPS Hosting is Not a Viable Alternative**: Relying on cheap VPS hosting is often riskier than using major cloud providers, as it lacks the reliability, scalability, and redundancy necessary for real-world applications. [07:30], [11:09]
Topics Covered
- Why VPS hosting is a delusion for real software.
- Does an AWS outage make the internet more resilient?
- Abstractions accelerate innovation, they aren't 'just wrappers.'
- Cloud democratizes infrastructure, not monopolizes it.
- Capitalism made enterprise infrastructure accessible to all.
Full Transcript
You might have noticed that pretty much
the entire internet was down yesterday.
Yeah, like the whole thing. I couldn't
even order food for my cat here. It was
nuts. Everything from Snapchat to
Netflix to Door Dash to McDonald's was
entirely down. How does everything go
down at once? Well, AWS. So much of the
web is powered by AWS. And so many
services and apps we rely on every day
are all based out of US East one from
Amazon. So if that goes down, everything
goes down.
even if my cat wants to be in my face.
Be thankful this guy's not your IT
admin. There's a good chance you've seen
this one covered in many different
places by now. There are so many sources
and other great creators and YouTubers
that have broken down the depth of what
happened here and why it happened.
Everything from DNS to Dynamob to
multi-reion failures, etc. But I want to
talk about this for a very different
reason. I want to talk about this
because I saw this post from Elizabeth
Warren and I'm from Elizabeth Warren
state, Massachusetts. Normally I try to
not talk about politics here and I'm
going to do my best to not make this
political, but I want to explain why the
current state of the cloud is actually a
really good thing. Why it's awesome that
we have services like AWS, GCP, and
Azure to rely on. And why the idea of
breaking these companies up after
today's outage is a really, really,
really bad idea. All of that said,
someone's got to pay for this cat's
food. So, we're going to take a quick
break for today's sponsor and then we'll
dive right in. Your engineers are really
fast. Your GitHub CI probably isn't. At
least unless you're using today's
sponsor, Blacksmith. These guys really
get GitHub CI in actions better than
almost anyone I've talked to in my life.
I just got back from one of their events
and I had such a good time chatting with
these guys. The thing I didn't realize
until recently is that they're not just
the best cheapest alternative to GitHub
actions. That's literally one line of
code. Like, that's enough of a reason to
use it, right? But where it gets really
fun is the observability stuff. As you
guys know, as I complain about a lot,
GitHub doesn't really improve their
platform. The fact that GitHub actions
still work the exact same way they have
for like almost 10 years now is
insulting. There's no way to see how
high your failure rates are, where
things are failing in the pipeline, and
just to get like an overview of your
actual work. You just scroll through a
list of failed or past jobs. Useless. I
mean, actual observability into your CI
is unbelievable. And the more I've heard
people moving over to Blacksmith, the
more I've been hearing them say this is
their favorite part. It's so much easier
to figure out why things are failing,
sort through your logs, fix failing
tests, and so much more. If you're like
me and you've been shipping way more
code lately, Blacksmith's going to make
your life significantly easier. It's
free to get started. It's way faster
than GitHub, and it's cheaper overall,
too. It's really hard to go wrong with
Blacksmith. And if you don't believe me,
check them out now at
soyv.link/blacksmith.
Here are the three key things I want to
focus on for this video. real quick
overview of why it happened. Then we're
going to go into why you should still
use AWS despite this. And then we'll
wrap up with my response to that
Elizabeth Warren post, why I think it's
stupid to break up AWS. So let's start
with why it happened. There's a lot of
varying coverage. I think that this
article from register is one of the
better options. The TLDDR is that there
was a DNS issue through Dynamo that
caused this first layer of failures. US
East has multiple redundant data
centers, but there's an abstraction on
top to route the traffic to those data
centers. in that abstraction layer, that
DNS zone failed. According to Amazon
themselves, AWS experienced increased
error rates on AWS services in US East1,
which impacted Amazon.com and Amazon
subsidiaries as well as AWS support
operations. This is between 11:49 p.m.
on the 19th and 2:24 a.m. on the 20th. I
was out partying at TwitchCon, which is
really funny because my bed would have
woken me up otherwise. I was sleeping.
It was 2 a.m.ish and I was alerted about
the issue because my internet mattress
which I love, shout out was warm which
led me to opening the app and the app
for his eight was having issues because
he of the AWS outage and he couldn't
change the temperature of his bed.
Totally not a problem I've had before.
Definitely not a thing I've experienced
in the past. By the way, I do have an
affiliate link if you want to sleep
better than you've ever slept. Soyv.leep
link in the description. Sounds insane,
but having water cooled bed that changes
temperature throughout the night is
actually really nice. I bought it to
make fun of it and now it's like my
favorite thing in the world. I missed it
dearly during my trip. So that was when
the outage happened late at night. The
cause of this event was a DNS resolution
issue for the regional DynamoB service
endpoints. It was mitigated by 2:24 a.m.
After this was resolved, AWS services
began recovering, but a small subset of
internal subsystems continued to be
impaired. They are still investigating
what caused this second layer of things
that meant the outage was continuing. It
took until 3:01 p.m. for all things to
be properly fully restored. There's a
lot of layers why that second set of
failures happened. It's probably a
combination of all the work that was put
off due to the fact that the outage was
occurring, all the lambdas that were
cued, all of the work in SQS, all the
different things people do on AWS. It
probably like thundering hurtded itself
into death over and over again some
amount. But we'll wait till we have a
more thorough update from Amazon to be
confident on that. There are also claims
that this is a result of the brain drain
as more and more quality people from
Amazon and AWS have been leaving and now
the best engineers aren't there and
who's left might not remember how DNS
works. As we know, it's always DNS.
Yeah, that's the TLDDR of the why it
happened. Engineers everywhere
pretending to monitor the situation
while refreshing the AWS status page.
This was so real. This was us for T3
chat, which yes, was down. It's funny
cuz we were not in US East until I moved
to Convex. So technically, this is once
again Convex causing me problems. As
silly as that is, Convex is not the
problem at all. They made the right
choice here. Also, Hner trying to get
dunks and somebody who I have blocked as
the starting point. Very annoying cuz
like yes, Verscell was down like
everything else is there. Yeah, I guess
this is a good transition into step two.
First off, thank you Sam for replying to
this. Well, I love Sam. If you don't
know him, he makes incredible articles
about all sorts of different stuff. He
did the load balancing article that we
did a video about. He did the retries
one and the queuing one. So many great
articles. I highly recommend checking
his stuff out. This is a good post.
You're not down today, but one day you
will be. You work under the same
constraints everyone else does. You know
how stressful it is when it happens to
you. Be better. Yes, betting on someone
else doesn't make the problem go away.
It's yet another bet. every additional
layer of bets does make things riskier.
So the fact that for me there is AWS as
like layer one. So we have AWS
US East1 and then for my database I have
convex as a layer. But even convex is an
abstraction on top of planet scale
metal. To be truthful
metal is very separated from AWS. It is
in the same regions, but they are
renting shelf space and like racks that
they put their own hardware into. So,
they're as much to the side here as they
are like another layer. I know a lot of
people who are on planet scale had no
issues despite the fact that AWS was
[ __ ] itself. So, this is my stack.
So, now if any piece here fails, T3 chat
on top is a tiny little box is screwed.
And when you add more layers, you do
potentially increase the surface area
for problems to happen. But there are
certain ones like Planet Scale Metal
that tend to increase reliability rather
than decrease it. But that leads us to a
question. Does using the same servers as
Amazon increase or decrease risk? The
fact that people are currently
unironically saying that it is riskier
to use Amazon servers than it is to use
some random company hosting oneoff
servers for way too cheap in Germany.
That one power surge could cost you all
your data is pretty absurd to me. It
just no. It makes no sense. If this was
the obvious correct path, you would have
companies like Netflix doing it. Why
would Netflix, a multi-billion dollar
company, do this wrong? If you genuinely
believe it is easier and safer to host
VPS's, why is it only random hobbyists
and people who have a service with a 100
users talking about it ever? I've never
seen real production apps talking about
this outside of a little bit of stuff
that's slowly dying from the DHH camp.
Like, I hate to be so direct about this.
It's just kind of silly that none of the
people building real software are
building this way. I've never seen
someone proudly bragging that all of
their stuff is on Hetner or is on some
random modern VPS solution that doesn't
have endless problems or zero users.
Like it's just always the case. Does
this mean you should never learn how to
do that? No, absolutely not. Everyone
should know how to spin up a Linux
server and run code on it if you're
working on servers for a living. But you
should have a general idea of how the
pieces work so that you know what you're
not dealing with when you move to
something like AWS. And no, this isn't
because corporations prefer AWS. That is
[ __ ] delusional. There are companies
that are competing with Amazon that are
still building on top of AWS. They're
doing it because it's the right balance
of capability, price, reliability, and
overall functionality. Like, it's just
duh. I I just I cannot take anyone
seriously who is sitting here saying
that AWS, GCP, and Azure are bad bets
because you could just use a server.
None of these people are building real
software. I'm sorry. Sure, you can sit
here and [ __ ] on me all you want and
say, "Well, Theo, you could put T3 chat
on a VPS, a really big one, but yeah,
maybe." And the moment we have slightly
too many users, we're screwed or we have
to start putting a Kubernetes layer in
front to distribute it. Fun.
By the way, Mammud here made a great
post questioning why someone would
deploy their app to a VPS in 2025 as an
engineer at Railway, which is one of the
few companies that is actually hosting
their own bare metal. They've moving
everything over to their own servers
recently. They are kind of like modern
Heroku. Great company. I have a lot of
friends there. What's funny with Railway
is that they're the company that should
be saying this because to an extent
they're letting you host VPS's, but they
have, as many have, built a much better
abstraction on top because doing the
exact same management and configuration
everyone else has to do, makes no
[ __ ] sense. None. None at all. And
even a company like Cloudflare, which is
basically essential if you have a CDN
and DOS protection, which by the way, if
you're hosting on a VPS, you [ __ ]
need DOS protection. There is no world
in which you're hosting on a VPS and
dealing with insane loads without
something in front to protect it. And if
you have Cloudflare as the layer in
front preventing DOS attacks, there's a
very good chance that layer is going to
go down at some point because it has in
the past due to the fact that they're
using GCP still for the storage for
things like KV. Yes, really. There are
parts of Cloudflare that are using GCP
still. They're working on their way off
it as far as I know, but it's still
there. So, everyone is vulnerable. If
your service is not vulnerable to
outages like this, not saying you're in
US East1, but if your service is not
vulnerable to something like this, your
service isn't real or you're just wrong.
Period. Point blank. End of story.
Everyone is vulnerable to something in
their chain failing or they don't have a
real [ __ ] chain. This is not a time
for us to be like, "Well, maybe we
should move to servers." No, that's just
[ __ ] delusion. It's cope. It's people
who don't know what they're talking
about. It's a bunch of web devs hosting
[ __ ] in PHP that pretend they know how
services work. I'm sorry. just not the
real world. That all said, we should
probably stop doing everything in one
region. Surely AWS does have multiple
data centers within their region, and if
any one fails or has an outage, the
other two are probably fine. But the
problem here isn't that the service
within AWS fails and that the data
center went down. Is that the layer that
routes to that data center did? If
anything, I would argue that it is
incredibly impressive that there is an
outage that can occur at the DNS level
that only hits and affects one region
with AWS because this region has
multiple data centers. If one data
center fails, things happen. If they all
fail, worse things happen. But if you
have DNS routing between those things
and that fails, but it only fails in
East one, that's almost a [ __ ]
miracle. The fact that only one piece of
AWS went down for this is genuinely
impressive. And generally speaking, the
fact that AWS outages like this are so
expensive is more incentive for them to
keep it from happening in the future. So
many companies hide from the fact that
they have outages like this. Amazon
can't Amazon lost billions upon billions
of dollars because of this outage. Do
you seriously think they're going to
just let it happen again, but they're
not going to put a shitload of guards in
place to prevent this from happening
again? This outage made the internet
more resilient than it's ever been. And
that's going to continue to be the case.
AWS is going to invest more heavily than
they ever have in resilience for these
things because otherwise people will
actually move and also their own
businesses will lose money because
Amazon was down for a lot of this as
well. It's just it's genuinely silly to
me to think that this is a reason to
move off of AWS. Like what? Cathode Ray
dude is one of my favorite YouTubers. He
did a great video on Teos recently and
how the company's slowly dying. And NMG
just dropped a quote from him here that
I think fits here perfectly. How dare
they ask me to pay money for a
well-designed, purpose-built device that
does a great job at a specific task that
I value highly. What fiends? I'll show
them by spending the same amount of
money and vastly more time and effort
making something that works almost half
as well. Yeah, this is the problem. This
is how I feel whenever somebody talks
about things that have to do with not
using AWS. Like, oh, cool. So, you're
going to go reinvent everything
yourself. Let's see how that goes for
you.
Let me know when you can actually work
on your software again. I have seen
people sharing this and a bunch of memes
in a similar format claiming that
companies like Forcell are in the end
just AWS rappers. Even Netflix is just
an AWS rapper. Do you know what C++ is?
Give you a hint. It has to be compiled
to assembly. Everything's a [ __ ]
rapper. I am so tired of this goddamn
argument. The entirety of technology is
rappers on top of other things. Welcome
to software. There's almost nothing that
is that bottom level. Literally every
single thing we work in and touch that
is worth using is an abstraction.
Companies selling shirts aren't creating
their own fabric. Engineers writing C++
are not writing their own assembly.
Services giving you access to servers
are not going to host their own servers.
Unless, and this is the key, unless
there's a really good specific reason to
do it. When a company is making
infrastructure like Versel and they
choose to work with AWS, what that tells
me isn't, oh, they're going to
overcharge me for an AWS rapper. What it
tells me is that the problem space they
choose to solve in is different from the
problem space that AWS is solving in.
And rather than try to reinvent all of
the hardworking billions of dollars that
AWS has spent to make incredible
scalable servers for a reasonable price,
they decided to take advantage of that
existing work and build on top of it.
This actually makes for a really good
transition to point three. Why breaking
up Amazon is stupid. This is the post
that inspired my whole rant. According
to Elizabeth Warren, if a company can
break the entire internet, they are too
big. Period. It's time to break up big
tech. As the community note correctly
states here, AWS is not a monopoly. It
represents 30% of the web. The fact that
30% of the web feels so big is pretty
wild, but it does feel like so much was
down. Is it bad that one company going
down can affect other things? Perhaps.
But that's how supply chains work. When
the Suez Canal was blocked, the entire
world was put on hold. Is it okay for a
company to own something? Is it valuable
as the Suez Canal? If there was only
one, probably not. But there isn't.
There are so many options. There are
arguably too many options for hosting. I
would argue the opposite of this point
specifically. Personally, I think it's
pretty [ __ ] cool that some random
vibe coding kid has access to the exact
same infrastructure as Netflix. That is
the magic of what's happened here
because companies like AWS have to put
so much work in to build the products
that they're building. To build AWS is
such a complex thing to do. They decided
early on through an Amazon mandate
through Jeff that anything that was
being used by multiple teams should be
built in a way that it could be
abstracted and sold to external
customers because there was so much work
being redone internally. They realized
at Amazon, oh, we have four teams that
are trying to find a way to store files.
What if we made a generic solution for
storing files and let all of the teams
use it? That innovation they made was
awesome. And if that innovation is a
thing they were forced to keep
internally because they're scared of
being broken up if they sell it, then
the entire ecosystem is going to slow
down because now every single [ __ ]
company has to reinvent file uploading.
Please explain to me in detail why that
would be a good idea. Why we need to
reinvent file uploading again and again.
Trust me, this is not a fight you want
to take with me in particular. Anyways,
it's pretty cool that anybody can use
the exact same infra as the biggest
companies in the [ __ ] world. That I
have the same level of reliability of
scalability of service like
functionality as companies like Amazon
do. Because instead of them taking all
of that work they put into every single
service and refusing to let anyone
benefit from it, they're doing the
opposite. They're buying companies like
Twitch, my old employer, taking their
video info that was exclusively usable
by Twitch, turning it into a service,
and selling it to Twitch's competition,
like Kick. That's awesome. I think that
is great. This is one of the few times
that you can point at and with almost no
argument against it, say capitalism is
working well. Companies had incentives
to build these things for themselves and
then due to the nature of markets, they
had incentive to sell those things to
others. That's a good thing. The fact
that we all don't have to go reinvent
the concept of storing files is a really
positive win for the entire ecosystem.
It's part of what makes the whole tech
world as awesome and productive as it
is. We can build on top of the hard work
of others. If every single service, if
every single app, if every single
vibecoded thing on Replet had to be
built using a bunch of things that they
rolled themselves instead of building on
top of these layers, we would all still
be writing [ __ ] assembly. This is all
on top of the fact that it's not a
goddamn monopoly. There are multiple
options. There's all the VPS bros
shedding themselves, but there's also
Google Cloud, which I have my issues
with. It's a reason people don't like it
as much. Azure, which is making almost
as much money as AWS now because they
charge insane licensing fees. We're also
using Azure for a bunch of our inference
right now, 4T3 chat. There's also
smaller companies I've been showcasing
like Railway that I think are really
cool, too. There's a surprising number
of businesses relying on one specific
region in AWS because it's useful. It's
a good thing to rely on and the
reliability tends to be pretty solid.
I'm going to go a bit of a different
direction here and talk about this, my
new iPhone. This is the iPhone 17 Pro
Max. It's expensive. I'm not going to
say otherwise, but the fact that pretty
much every iPhone user is on the same
effective device with like a plus or
minus 10 to 20% performance gap is
actually a really cool thing. The fact
that a billionaire doesn't have a way
more expensive version of a phone than
somebody who's making 50k a year is
good. Look at almost every other market
outside of tech. Look at cars. Look at
houses. Look at look at flights. Look at
everything else in the world. The
version somebody with a median wage uses
is fundamentally different from the
version that a big billionaire class
person uses.
I think it's nice that that's not always
the case and that you can't spend way
more money to get a way better iPhone.
The best iPhone in the world is like
$1,600. You cannot get better than that.
And the only difference is that it's
more and the only difference is that
there's more storage. Like this is cool.
I really like the fact that something
like AWS allows for you, me, and
billionaires to have the same exact
godamn infrastructure. That's awesome.
This is capitalism winning. It's driven
the costs down so much and the quality
of product so high that there's no
reason to get something bigger and
fancier. You can get abstractions around
the best thing right now. You can get
things that make the DX better with AWS.
Things like Versell, they're not paying
me. I just like using them. But we're
all building on top of the same thing
because that one thing has be
effectively become a commodity. The same
way that you and I drink the same water
and you and I own the same iPhone, you
and I use the same servers on AWS or on
Google or on Azure even, that's a
[ __ ] win. And the fact that you can
buy a $600 iPhone right now and have the
same quality of experience minus, I
don't know, maybe the camera slightly
worse than on my $1,400 to $1,800 one.
Cool. That's good. If you need more, get
more. Who cares? This is this is a good
thing. And it is genuinely annoying to
me that when any little thing goes
slightly wrong, the response is, "Never
should anyone use this again. We should
be breaking this up. It should be
illegal. It's terrible. It's bad. It's
awful." Not saying AWS doesn't have
problems. I'm certainly not saying Apple
doesn't. I could rant about the state of
the App Store for hours. In fact, I have
in the past. Check out my other videos.
What I'm saying is that it's pretty cool
that the level of entry is the same for
everyone and that we have successfully
made good enough primitives to work at
almost every different scale and that
people who are independents working on
small projects are benefiting from the
same new infrastructure and powerful
things that are being built by AWS as
companies like Netflix are. That's a
good thing. Imagine a world where
Netflix can spend billions of dollars
building infrastructure primitives and
then keeps it to themselves and no one
else is allowed to use those and you
have to build it yourself to compete at
all. The reason that there are these
small projects that are competing with
big companies now, the reason a 100
person team can be a real threat to a
100,000 person company is because AWS
has given access to everyone. All of
this said, the fact that Amazon uses
their profits from AWS to discount
things on Amazon.com to squash
competition, that might be worth talking
about a bit. Conversation for another
day. The death of diapers.com through
Amazon subsidies is a thing that is
actually monopolistic and is worth
talking about, but AWS being a useful
service has nothing to do with that. So,
Warren please
for the love of all things American,
don't say this or at the very least talk
to somebody technical that isn't a VPS
bro before saying something stupid like
this in the future. This makes all of us
look bad. In the words of Gurgley here,
AWS didn't break the entire internet.
Companies that decided to build
non-resilian systems depending on one
single cloud region, East One, broke
themselves. Notice how X, Google, Meta,
Shopify, etc., We're all fine. Breaking
up AWS would solve nothing. That's all I
got on this one. Let me know what you
all think.
Loading video analysis...