The Infinite Software Crisis – Jake Nations, Netflix

By AI Engineer

Summary

Topics Covered

We Ship Ununderstood Code
Confuse Easy with Simple
AI Compounds Accidental Complexity
Context Compression Beats Generation
Understanding Survives AI Generation

Full Transcript

[music]

Hey everyone, good afternoon. Um, I'm going to start my talk with a bit of a confession. Uh, I've shipped code I didn't quite understand. Generated it, tested it, deployed it. Couldn't explain how it worked. And here's the thing, though. I'm willing to bet every one of you have, too. [applause] So, now I'm going to admit that we all ship code that we don't understand anymore. I want to take a bit of a journey, see how this kind of has come to be. First, look back in history. We

see that history tends to repeat itself. Second, we've fallen into a bit of a trap. We've confused easy with simple. Lastly, there is a fix, but it requires us not to outsource our thinking. So, I spent the last few years at Netflix helping drive adoption of AI tools, and I have to say the acceleration is absolutely real. Backlog items that used to take days now take hours, and large refactors that have been on the books for years are finally being done. Here's the thing, though.

Large production systems always fail in unexpected ways. Like, look what happened with CloudFare recently. When they do, you better understand the code you're debugging. And the problem is now we're generating code at such speed and such volume our understanding is having a hard time keeping up. Hell, I know I've done it myself. I've generated a bunch of code, looked at it, thought, I have no idea how this what this does. But, you know, the test pass, it works. So, I shipped it. The thing

here is this isn't really new. Every generation of software engineers has eventually hit a wall where software complexity has exceeded their ability to manage it. We're not the fa first to face a software crisis. were the first to face it at this infinite scale of generation. So let's take a step back to see where this all started. In the late 60s, early '7s, a bunch of smart computer scientists at the time came together and said, "Hey, we're in a software crisis. We have this huge

demand for software and yet we're not really able to keep up and like projects are taking too long and it's just really slow. We're not doing a good job." So Dystra Kano came up with a really great quote and he said when we had a few weak computers and I mean to paraphrase a longer quote when we had a few weak computers programming was a mild problem and now we have gigantic computers programming has become a gigantic problem. He was explaining as hardware power grew by a factor of a

thousand society's wants of software grew in proportion and so it left us the programmers to figure out between the ways and the means how do we support this much more software. So this kind of keeps happening in a cycle. In the 70s we get the C programming language so we could write bigger systems. The 80s we have personal computers. Now everyone can write software. In the '9s we get object-oriented programming inheritance hierarchies from hell where you know

thanks Java for that. In the 2000s we get agile and we sprints and scrum masters telling us what to do. There's no more waterfall. In the 2010s we had cloud mobile devops you know everything. Software truly ate the world. And today now we have AI. you know, co-pilot cursor claude codeex gemini, you name it. We could generate code as fast as we can describe it. The pattern continues, but the stale has really changed. It's it's infinite now. So, uh, Fred Brooks, you might know him

from writing the mythical man month. He also wrote a paper in 1986 called No Silver Bullet. And in this, he argued that there'd be no single innovation that would give us an order of magnitude improvement in software productivity. Why? Because he said the hard part wasn't ever the mechanics of coding. the syntax, the typing, the boilerplate. It was about understanding the actual problem and designing the solution. And no tool can eliminate that fundamental difficulty. Every tool and technique

we've created up to this point makes the mechanics easier. The core challenge though, understanding what to build, how it should work remains just as hard. So, if the problem isn't in the mechanics, why do we keep optimizing for it? How do experienced engineers end up with code they don't understand? Now, the answer, I think, comes down to two words we tend to confuse. simple and easy. We tend to use them interchangeably, but they really mean completely different things. Uh I was

outed at the speaker dinner as being a closure guy, so this is kind of clear here. But Rich Hickey, the creator of the closure programming language, explained this in his talk from 2011 called simple made easy. He defined simple meaning one fold, one braid, and no entanglement. Each piece does one thing and doesn't intertwine with others. He defines easy as meaning adjacent. What's within reach? What can you access without effort? Copy paste ship. Simple is about structure. Easy is

about proximity. The thing is we can't make something simple by wishing it. So simplicity requires thought, design and untangling. But we can always make something easier. You just put it closer. Install a package, generate it with AI, you know, copy a solution off of Stack Overflow. It's it's human nature to take the easy path. We're wired for it. You know, as I said, copy something from Stack Overflow. It's right there. framework that handles everything for you with

magic. Install and go. But easy doesn't mean simple. Easy means you can add to your system quickly. Simple means you can understand the work that you've done. Every time we choose easy, we're choosing speed now. Complexity later. And honestly, that trade-off really used to work. The complexity accumulated in our codebases slowly enough that we can refactor, rethink, and rebuild when needed. I think AI has destroyed that balance because it's the ultimate easy bun. And

magic. Install and go. But easy doesn't mean simple. Easy means you can add to your system quickly. Simple means you can understand the work that you've done. Every time we choose easy, we're choosing speed now. Complexity later. And honestly, that trade-off really used to work. The complexity accumulated in our codebases slowly enough that we can refactor, rethink, and rebuild when needed. I think AI has destroyed that balance because it's the ultimate easy bun. And

it makes the easy path so frictionless that we don't even consider the simple one anymore. Why think about architecture when code appears instantly. So let me show you how this happens. How a simple task evolves into a mess of complexity through a conversational interface that we've all come to love. You know this is a contrived example but you know say we have our app. We want to add uh some authentication to it. We say add o. So we get a nice clean o.js file.

Iterate on a few times it gets a message file. You're like okay cool. We're going to add OOTH now too because and now we've got an OJS and OOTHJS. We keep iterating and then we find ourselves that sessions are broken and we got a bunch of conflicts and by the time you get to turn 20, you're not really having a discussion anymore. You're managing context that become so complex that even you don't remember all the constraints that you've added to it. Dead code from

abandoned approaches. Uh tests that got fixed by just making them work. You know, fragments of three different solutions because you have saying wait actually each new instruction is overwriting architectural patterns. We said make the off work here. It did. When we said fix this error, it did. There's no resistance to bad architectural decisions. The code just morphs to satisfy your latest request. Each interaction is choosing easy over simple. And easy always means more

abandoned approaches. Uh tests that got fixed by just making them work. You know, fragments of three different solutions because you have saying wait actually each new instruction is overwriting architectural patterns. We said make the off work here. It did. When we said fix this error, it did. There's no resistance to bad architectural decisions. The code just morphs to satisfy your latest request. Each interaction is choosing easy over simple. And easy always means more

complexity. We know better. But when the easy path is just this easy, we take it. And complexity is going to compound until it's too late. AI really takes easy to its logical extreme. Decide what you want. Get code instantly. But here's the danger in that. The generated code treats every pattern in your codebase the same. You know, when an agent analyzed your codebase, every line becomes a pattern to preserve. The authentication check on line 47, that's a pattern. That weird

complexity. We know better. But when the easy path is just this easy, we take it. And complexity is going to compound until it's too late. AI really takes easy to its logical extreme. Decide what you want. Get code instantly. But here's the danger in that. The generated code treats every pattern in your codebase the same. You know, when an agent analyzed your codebase, every line becomes a pattern to preserve. The authentication check on line 47, that's a pattern. That weird

gRPC code that's acting like GraphQL that I may have had in 2019, that's also a pattern. Technical debt doesn't register as debt. It's just more code. The real problem here is complexity. I know I've been saying that word a bunch in this talk without really defining it, but the best way to think about it is it's the opposite of simplicity. It just means intertwined. And when things are complex, everything touches everything else. You can't change one thing without affecting 10 others.

So, back to Fred Brooks's no bullet paper. In it, he identified that there's two main types of complexity in every system. There's the essential complexity, which is really the fundamental difficulty of the actual problem you're trying to solve. Users need to pay for things, orders must be fulfilled. This is the complexity of why your software system exists in the first place. And then second, there's this idea of accidental complexity. Everything else we've added along the

way, workarounds, defensive code, frameworks, abstractions that made sense a while ago, it's all the stuff that we put together to make the code itself work. In a real codebase, these two types of complexity are everywhere and they get so tangled together that separating them requires context, history, and experience. the generated output makes no such distinction and so every pattern is keeps just getting preserved. So here's a real example from uh some work we're doing at Netflix. I have a

system that has a abstraction layer sitting between our old authorization code we wrote say five or so years ago and a new centralized o system. We didn't have time to rebuild our whole app. So we just kind of put a shim in between. So now we have AI. This is a great opportunity to refactor our code to use the new system directly. Seems like a simple request, right? And no, it's like the old code was just so tightly coupled to its authorization patterns. Like we had permission checks

woven through business logic, ro assumptions baked into data models and off calls scattered across hundreds of files. The agent would start refactoring, get a few files in and hit a dependency couldn't untangle and just spiral out of control and give up or worse it would try and preserve some existing logic that from the old system and recreating it using the new system which I think is not great too. The thing is it couldn't see the scenes. It couldn't identify where the business

logic ended and the off logic began. Everything was so tangled together that even with perfect information, the AI couldn't find a clean path through. When your accidental complexity gets this tangled, AI is not the best help to actually make it any better. I found it only adds more layers on top. We can tell the difference, or at least we can when we slow down enough to think. We know which patterns are essential and which are just how someone solved it a few years ago. We carry the

context that the AI can infer, but only if we time to make take time to make these distinctions before we start. So how do you actually do it? How do you separate the accidental and essential complexity when you're staring at a huge codebase? Codebase I work on Netflix has around a million lines of Java and the main service in it is about 5 million tokens last time I checked. no context window I have access to uh can hold it. So when I wanted to work with it, I

first thought, hey, maybe I could just copy large swaths of this codebase into the into the context and see if the patterns were emerged, see if it would just be able to figure out what's happening. And just like the authorization refactor from previously, [clears throat] the output just got lost in its own complexity. So with this, I was forced to do something different. I had to select what to include. Design docs architecture diagrams key interfaces, you name it, and take time

writing out the requirements of how components should interact and what patterns to follow. See, I was writing a spec. Uh 5 million tokens became 2,000 words of specification. And then to take it even further, take that spec and create an exact step set of steps of code to execute. No vague instructions, just a precise sequence of operations. I found this produced much cleaner and more focused code that I could understand. As I defined it first and planned its own execution,

this became the approach which I called context compression a while ago. But you call it context engineering or spectriven development, whatever you want. The name doesn't matter. What only matters here is that thinking and planning become a majority of the work. So let me walk you through that how this works in practice. So we have step one, phase one, research. You know, I go and feed everything to it up front. Architecture diagrams, documentation, Slack threads.

I been over this a bunch, but really just bring as much context as you can that's going to be relevant to the changes you're making. And then use the agent to analyze the codebase and map out the components and dependencies. This shouldn't be a oneshot process. I like to probe say like what about the caching? How does this handle failures? And when it's analysis is wrong, I'll correct it. And if it's missing context, I provide it. Each iteration refineses its analysis.

The output here is a single research document. Here's what exists. Here's what connects to what. And here's what your change will affect. Hours of exploration are compressed into minutes of reading. [snorts] I know Dex mentioned it this morning, but the human checkpoint here is critical. This is where you validate the analysis against reality. The highest leverage moment in the entire process. Catch errors here. Prevent disasters later. Onto phase two. Now that you have some

valid research in hand, we create a detailed imple implementation plan. Real code structure, function signatures, type definitions, data flow. You want this to be so any developer can follow it. I I kind of liken it to paint by numbers. You should be able to hand it to your most junior engineer and say, "Go do this." And if they copy it line by line, it should just work. This step is where we make a lot of the important architectural decisions. You know, make sure complex logic is

correct. Make sure business requirements are, you know, following good practice. Make sure there's good service boundaries, clean separation, and preventing any unnecessary coupling. We spot the problems before they happen because we've lived through them. AI doesn't have that option. It treats every pattern as a requirement. The real magic in this step is the review speed. We can validate this plan in minutes and know exactly what's going to be built. And in order to keep up

correct. Make sure business requirements are, you know, following good practice. Make sure there's good service boundaries, clean separation, and preventing any unnecessary coupling. We spot the problems before they happen because we've lived through them. AI doesn't have that option. It treats every pattern as a requirement. The real magic in this step is the review speed. We can validate this plan in minutes and know exactly what's going to be built. And in order to keep up

with the speed at which we want to generate code, we need to be able to comprehend what we're doing just as fast. Lastly, we have implementation. And now that we have a clear plan and like backed by a clear research, this phase should be pretty simple. And that's the point. You know, when AI has a clear specification to follow, the context remains clean and focused. We've prevented the complexity spiral of long conversations. And instead of 50 messages of evolutionary code, we have

three focused outputs, each validated before proceeding. No abandoned approaches, no conflicting patterns, no wait actually moments that leave dead code everywhere. To me, what I see is the real payoff of this is that you can use a background agent to do a lot of this work because you've done all the thinking and hard work ahead of time. It can just start the implementation. You can go work on something else and come back to review and you can review this quickly because

you're just verifying it's conforming to your plan, not trying to understand if anything got invented. The thing here is we're not using AI to think for us. We're using it to accelerate the mechanical parts while maintaining our ability to understand it. Research is faster, planning is more thorough, and the implementation is cleaner. The thinking, the synthesis, and the judgment though that remains with us. So remember that uh authorization refactor I said that AI couldn't handle.

The thing is now we're actually, you know, working on it now starting to make some good progress on it. The thing is it's not because we found better prompts. We found we couldn't even jump into doing any sort of research, planning, implementation. We actually had to go make this change ourself by hand. No AI, just reading the code, understanding dependencies, and making changes to see what broke. That manual migration was, I'll be honest, it was a pain, but it was crucial. It revealed

all the hidden constraints, which invariants had to hold true, and which services would break if the off changed. things no amount of code an analysis would have surfaced for us. And then we fed that pull request of the actual manual migration into our research process and had it use that as the seed for any sort of research going forward. The AI could then see what a clean migration looks like. The thing is each of these entities are slightly different. So we have to go and

interrogate it and say hey what do we about do about this? Some things are encrypted some things are not. We had to provide that extra context each time uh through a bunch of iteration. Then and only then we could generate a plan that might work in one shot. And the key and might's the key word here is we're still validating, still adjusting, and still discovering edge cases.

The three-phase approach is not magic. It only works because we did this one migration by hand. We had to earn the understanding before we can code into our process. I still think there's no silver bullet. I don't think there's better prompts, better models, or even writing better specs, just the work of understanding your system deeply enough that you can make changes to it safely.

So why go through with all this? Like why not just iterate with AI until it works? Like eventually won't models get strong enough and it just works. The thing to me is it works isn't enough. There's a difference between code that passes test and code that survives in production. between systems that function today and systems that that can be changed by someone else in the future. The real problem here is a knowledge gap. When AI can generate thousands of lines of code in seconds,

understanding it could take you hours, maybe days if it's complex. Who knows, maybe never if it's really that tangled. And here's something that I don't think many people are even talking about this point. Every time we skip thinking to keep up with generation speed, we're not just adding code that we don't understand. We're losing our ability to recognize problems. That instinct that says, "Hey, this is getting complex." It atrophies when you don't understand your own system. [snorts]

Pattern recognition comes from experience. When I spot a dangerous architecture, it's because I'm the one up at 3:00 in the morning dealing with it. When I push for simpler solutions, it's because I've had to maintain the alternative from someone else. AI generates what you ask it for. It doesn't encode lessons from past failures. The three-phase approach bridges this gap. It compresses understanding into artifacts we can review at the speed of generation. Without it, we're just

accumulating complexity faster than we can comprehend it. AI changes everything about how we write code. But honestly, I don't think it changes anything about why software itself fails. Every generation has faced their own software crisis. Dystra's generation faced it by creating the discipline of software engineering. And now we face ours with infinite code generation. I don't think the solution is another tool or methodology. It's remembering what we've always known. That software

is a human endeavor. The hard part was never typing the code. It was knowing what to type in the first place. The developers who thrive won't just be the ones who generate the most code, but they'll be the ones who understand what they're building, who can still see the seams, who can recognize that they're solving the wrong problem. That's still us. That will only be us. I want to leave on a question and I don't think the question is whether or not we will use AI. That's a foregone

conclusion. The ship has already sailed. To me, the question is going to be whether we will still understand our own systems when AI is writing most of our code. Thank you. [applause] [music]

conclusion. The ship has already sailed. To me, the question is going to be whether we will still understand our own systems when AI is writing most of our code. Thank you. [applause] [music]

[music] >> [music]

Loading...

Loading video analysis...