Nodes Aren't the Future of AI Creation. Here's What Is.
By Bilawal Sidhu
Summary
## Key takeaways - **Nodes Illusion of Control**: Nodes give you the illusion of control, but you're still hitting the generate button, waiting for the model to come back with something that you hope is close to what you want. They get in the way for creative tasks. [00:21], [00:39] - **Need 3D Viewport Control**: We are largely flying blind without the equivalent of a viewport, this 3D representation that allows us to see exactly what we had in mind before we actually hit that render button. In 3D applications, it is the visual anchor, the source of truth that gives you this interactive window into the world that you are creating. [00:35], [00:59] - **Nodes Are Assembly Language**: Nodes are like the assembly language of AI creation, but we need a higher order abstraction that feels a lot like it does to create on a sound stage. Nodes are a very effective way for you to take a bunch of different models and operations and connect them together, but that's plumbing, not filmmaking. [01:39], [02:02] - **Unreal Engine Ideal Model**: Inside of Unreal Engine, you've got node graphs when you need them, the timeline editor when you need them, and the viewport to understand exactly what it is you're doing and manipulating. The spatial solution has all these tools working together. [04:29], [04:56] - **3D Scene Graph for AI**: A 3D scene graph is nodes for the AI itself, this representation that tells the AI system everything that's in the environment, the entities within it, their characteristics, and interactivity. It gives you the language to intuitively author in viewport and timeline mode, but gives the AI something it can understand. [08:34], [09:03] - **Nodes Bridge, Not Destination**: Nodes are essentially a bridge to the future. They are not a destination; they're a good side dish, but can't be the main course. We need the viewport and the spatial 3D engine to make creation feel more human. [10:42], [10:51]
Topics Covered
- Nodes Illusion of Control
- Nodes Are Assembly Language
- Need Real-Time Viewport
- 3D Scene Graph Unlocks Continuity
- Nodes Bridge Not Destination
Full Transcript
If you used any generative media tool lately, you have probably noticed a trend.
Nodes everywhere the eye can see.
Whether it's Crea, Weevi, the OG Comfy UI, freaking Runway, heck, even Adobe is leaning into nodes. And quite frankly, I'm a little worried cuz to me, nodes aren't the future. In fact, I would
posit that they're getting in the way.
They give you the illusion of control, but you're still hitting the generate button, waiting for the model to come back with something that you hope is close to what you want. I would instead
posit what we need is 3D control. Here's
why.
Look, in my opinion, when we're doing generative AI right now, we are largely flying blind. We don't have the
flying blind. We don't have the equivalent of a viewport, right? Like
this 3D representation that allows us to see exactly what we had in mind before we actually hit that render button. It
really does feel like typing in some prompts and maybe a couple of image references and hoping for the best. Now,
if you used any 3D application, you know exactly what I'm talking about. It is
the visual anchor, the source of truth that gives you this interactive window into the world that you are creating.
And I would posit that in the case of AI. That's exactly what we need, except
AI. That's exactly what we need, except we're co-creating this world. And it's
really funny to me because a lot of people are like, "Oh my god, but like 3D is really hard. Like, I don't want to have to manipulate the camera angle and where all the people are and the lighting sources and all this other stuff." and instead they're dealing with
stuff." and instead they're dealing with this like this crazy node graph that like just is so hard to gro. I mean look like I grew up on After Effects and
Nuke. I appreciate when you need nodes
Nuke. I appreciate when you need nodes but honestly most of the time especially for creative tasks they kind of get in the way. Now look, this of course makes
the way. Now look, this of course makes sense, right? Because what we're doing
sense, right? Because what we're doing is plumbing, but let's not call it film making or content creation, right?
Because essentially nodes are a very effective way for you to take a bunch of different models, a bunch of different operations and be able to connect them together. Yes, it's great for that, but
together. Yes, it's great for that, but think about how far that is from the language and art of actually being on a set and recording something with your friends, even completely night and day.
And that's sort of my point that nodes are like the assembly language of AI creation, but we need a higher order abstraction that feels a lot like it
does to create on a sound stage.
Now look, nodes of course aren't new as I mentioned earlier like they have been the backbone of all sorts of crazy tools like of course Houdini famously, Nuke is another great example of course Blender as well. And the whole idea was like how
as well. And the whole idea was like how do you create a visual interface for people who aren't savvy with programming so they get the right knobs and can control sort of like the flow of data
the sequence and order of operations in a visual fashion not having to resort to writing lines of code. But the thing is we're missing half of the equation right like sure you can have this like node graph representation. Heck you can
graph representation. Heck you can maintain a very detailed 3D scene graph that you can manipulate but you don't have the viewport right now. The
viewport is basically hit that slot machine, baby, and hope you get a nice video generation or incrementally build out an image and then post-process that and then post-process that and on and on we go. In a sense, what we're lacking is
we go. In a sense, what we're lacking is real-time preview. And what I'm excited
real-time preview. And what I'm excited about is we've got this whole new crop of real-time video models that are popping up that will actually make this a possibility.
Now, before I get ahead of myself, let's take a look at sort of like the three approaches to create content right now and look at the strengths and weaknesses very quickly. Like, obviously, you've
very quickly. Like, obviously, you've got the stack, right? Like, this is a layer-based editing approach that we all know and love. If you use Adobe Photoshop or After Effects, you know, this sort of sandwich logic is very easy
to understand. Like, the order of
to understand. Like, the order of operations of compositing is you got the bottom of the stack and then you're just sandwiching things on top of it, right?
You can easily move things above or below an object just by changing where it is in the stack. Now, of course, as you probably also know, if you're in After Effects and you're like 50 pre-MPS deep into, you know, a composite that
you're trying to do, those layers can quickly become your enemy. This is where the node graph approach is super super helpful. Like essentially, you've got
helpful. Like essentially, you've got this approach for plumbing all your data with all sorts of operations that you can compose together and do very sophisticated image manipulation and, you know, data processing. Yeah. But as
you probably noticed, like you know, you still need the timeline view, right?
like the layers work really well in concert with the timeline, especially when you're doing things like key framing, right? That gets a little bit
framing, right? That gets a little bit more challenging once you start getting into node land, right? The point being the flow itself is not enough. You need
that timeline view. So, I would pause it that the closest thing we actually have is number three over here, which is the spatial solution Allah Unreal Engine.
Inside of Unreal Engine, you can do all of these things. You've got node graphs when you need them. You've got the timeline editor when you need them.
Heck, you can even switch to a sequence reviewer view where you're like basically moving around your shot list when you need that. And of course, you've got the viewport to understand exactly what it is you're doing and
manipulating.
So, this is the problem, right? Like
right now, we're in this space where you've got a new Lego block that drops every freaking two weeks that the main problem everyone's trying to solve is like, well, how do you put all these pieces together? And everyone's
pieces together? And everyone's defaulted as a consequence to a node graph. there's very little innovation
graph. there's very little innovation that's actually happening on like temporal or timeline based events, right? Like the best you can do is sort
right? Like the best you can do is sort of extend stuff, right? That's not super helpful. I mean, like that's very
helpful. I mean, like that's very limited, right? So, the solution to me
limited, right? So, the solution to me really does feel like a 3D application where we take advantage, especially of all these new world models that are popping up to populate the canvas, right? You create your character once
right? You create your character once because you know that's the character that's going to be in all of your generations. You create your environment
generations. You create your environment once and you can image it from any direction that you want. And then you can frame the shots exactly as you like rather than typing in esoteric like prompt incantations hoping that the
camera angle comes close to what you had in mind or drawing weird scribbles just to nudge things in the right direction.
It gets really weird. So, and I think that is going to be super freeing. Like
once we put the Lego bricks together, what are other views on exactly the composition that we're trying to build?
And I think that'll take us from this sort of like cage of nodes to giving us a lot of agency to build something where you are defining the composition and the motion intuitively in a fashion that you know really well. Like in fact, I really
want to be able to take my phone, have this motion tracked, have the camera feed go into the 3D application and use that to frame my shots out, right? Or
use something like a 3D mouse if you're familiar what this thing is.
But I think the fact is the reason companies aren't doing this is because it's hard, right? Like on one hand you've got the node graph style approach. The other way is templates.
approach. The other way is templates.
Like you see companies like Higsfield doing this basically saying like let me take a bunch of these complicated comfy UI graphs and wrap them in a very simple to use template. But that's only really good for making memes or really short
form content. Like the moment you start
form content. Like the moment you start talking about doing something that's longer than 3 to 5 minutes, it gets really exhausting. It is perhaps then
really exhausting. It is perhaps then unsurprising why most of the professionally generated content that you're seeing is a bunch of these chaotic 1 to two minute advertisements.
That's all you really see. And there are just a handful of creators that are pushing this boulder up a hill to make multi-minute cinematic content. And hats
off to them. But to bring everyone else into the mix, I think you need direct manipulation and real-time feedback.
And as I alluded to earlier, finally we have video models that are capable of doing exactly this where it's not just you using something like Genie to WSAD around and move the camera, you know, using your keyboard, but you could start
pumping in the AR camera pose like you mentioned from your camera itself. And
of course, you've got other models like motion stream that let you essentially manipulate objects in the generation directly as well, just like you would in a 3D engine. And that's where the problem, right? Like these systems don't
problem, right? Like these systems don't have spatial memory. And it's you as the human that's sort of managing the spatial context for it. In other words, you might go into World Labs and capture the environment from a couple different
angles. And every time you do that
angles. And every time you do that generation, you might provide that image reference sort of as the ingredient. But
I would posit that the system should do that for you.
And you're seeing that World Labs is actually investing in a similar direction where like if you're using a model like Genie and going around and you see an image, it records the three-dimensional pose of each of these images to give it that spatial memory.
But it's more than just the spatial memory of what you saw in environments, right? We're talking about building an
right? We're talking about building an entire experience, which is why I think it is very beneficial to have an explicit 3D scene graph. And as all these like visual language models get better and better at doing amazing
things in things like 3JS, you must have seen some of my previous experiences building amazing fun things. In fact,
heck, I built two games that you can go play on YouTube that were entirely vibecoded, right? Like this tech already
vibecoded, right? Like this tech already exists. it just hasn't been married
exists. it just hasn't been married perfectly with these amazing video diffusion models and auto reggressive models that are capable of doing amazing things on the visual end. So the beauty of a 3D scene graph is sort of like
nodes for the AI itself. Like it's this representation that tells the AI system everything that's in the environment, the entities that are within it, the characteristics of those entities, interactivity that is attached to any of
those entities. It gives you the sort of
those entities. It gives you the sort of language to intuitively author in viewport and timeline mode, but give the AI something that it can understand. And
isn't it ironic an AI system is better at understanding nodes than we are? Now,
look, I'm excited to see there are companies that are trying to tackle this problem with a 3D first approach. A
great example is intangible AI, which basically lets you use text prompts to flesh out a scene in 3D, right? Like, so
you know exactly what you want. You can
manipulate everything with full 3D level control, and then you just use generative AI to sort of take it all the way. There's another tool like this
way. There's another tool like this called Artcraft that lets you do exactly what I just described. Like you fleshed out your environment, right? Like you've
got a character within it and then you can go about prompting it. So like
especially when you're doing multi-shot continuity exactly like this example where let's say the protagonist needs security camera CCTV view. Very easy to do. Or it needs to like hide behind this
do. Or it needs to like hide behind this like artifact over here. You can scope that out so easily in this 3D world rather than having to prompt it or scribble it and do weird things. want to
get another angle like and then get like the jump scare reveal. Like those are all things that are so much easier for you to plan out in 3D. That is
fundamentally what I think is going to unlock this like multi-minute creation.
And so I think having this hybrid approach of 3D that's combined with generative AI, right? Like where you've got the timeline view when you need it.
You've got the nodes when you need it, but critically you've got the node representation of the 3D environment that the LLM can absorb and reason about and help you manipulate. You can finally start unlocking multi-minute creations.
Like we need that spatial AI engine. And
it's kind of funny if you've seen my previous videos about world models and physical AI. This is exactly what
physical AI. This is exactly what robotics needs too, right? Like except
the constraint for robotics training data is like make this look super photorealistic and true to life. What
we're talking about allows you to take creative liberties, right? Because photo
realism and cinematic realism are very different things.
So that's my take. Maybe I was a little spicy at the start. But you know, you kind of have to do that for these type of videos. But hopefully the points that
of videos. But hopefully the points that I'm bringing up engage with you, right?
Like nodes are essentially a bridge to the future. They are not a destination.
the future. They are not a destination.
They're a good side dish, a good companion, but they can't be the main course. We need the viewport. We need
course. We need the viewport. We need
the spatial 3D engine because ultimately we want creation to feel more human and not less, right? Like nodes are representation replace coding. Most
coders today are just using clawed code and not even looking at the code. Why
are we having creators fiddle around with all these node graphs? It makes no sense to me. And once we start embracing spatial interfaces, not only are we going to make better tools, right? Like
we're creating tools that work like our minds do. You're replicating that
minds do. You're replicating that experience being on that virtual production and sound stage with your friends creating a piece of content except your budget is now unlimited and
you can do it all inside the computer.
All right, so that's it for this video.
Uh so let me know in the comments below what do you think about nodes? What do
you think about timeline? What do you think about this spatial AI engine and this hybrid approach? What do you think is holding back multi-minute generations? And I'll see y'all in the
generations? And I'll see y'all in the next one.
Loading video analysis...