LongCut logo

LLMs Are Databases - So Query Them

By Chris Hay

Summary

Topics Covered

  • Transformers Have a Syntax-Knowledge-Output Architecture
  • Polysemanticity Is a Dimensionality Constraint
  • Attention Routes Through the Knowledge Graph
  • LLMs Are Literally Queryable Graph Databases
  • You Can Insert Facts Without Retraining

Full Transcript

What if I said every single large language model was a database?

Specifically, a graph database. Not

metaphorically, but physically. If we

looked underneath the hood at the weights, it was an actual database with real edges, with real entities, with real nodes. That's how it was physically

real nodes. That's how it was physically represented. What would that mean? Well,

represented. What would that mean? Well,

it would mean that we could query it. We

could insert new knowledge. We could

compile down to weights because it's just a database. And if it's just a database, then we could have query languages like SQL over the top of it.

We could programmatically work with the model if that were true. But the good news is it is true. So, let's do it. So,

we're going to query it with a language called Larql, which allows us to query large language models. So, I'm going to connect it up to the weights of the Gemma 34B model. Believe me, I will

connect it up to Gemma 4, but that will be a future video. So, I've now connected up here and you can kind of see there that that model has got 34 layers or 348,000

features. And in fact, if I just run

features. And in fact, if I just run stats against the model there, there you see Google Gemma 3, you see the number of features, and then as you can see within here knowledge graph, we have

mapped 1,785 features. So, actually what I've done is

features. So, actually what I've done is I've run a probe over the top of that and mapped the internal representation into representations that we know. You

can also see there's three layer bands.

So, there's a syntax layer, there's a knowledge layer, and the output layer.

This is the representation of the Google Gemma model. So, those early layers are

Gemma model. So, those early layers are really about understanding the question, what the syntax of the query is, and the middle layers are really about getting the knowledge out of the weights, and then the final layers is reformatting

that out for output. So, let's see what the model knows. So, I'm going to use a command called describe, and what describe allows us to do is describe some of the entities. So, I'll just type in describe France, and we will see what

the model knows about France. So, if we have a look at the output there, you can see in the syntax layers where it's figuring out the question. Look at L5, layer five there. You see the word Spanish, and then layer eight

international. So, you can see that it

international. So, you can see that it knows this is some sort of country type query. It's not quite figured out what

query. It's not quite figured out what it means. It's really working out the

it means. It's really working out the syntax as opposed to the knowledge. So,

as I said, L14 to L27, this is where the knowledge lives. So, you can see France

knowledge lives. So, you can see France say there at L23. Europe and Italy tagged as nationality. Country is tagged as borders. Then you've got Spain at

as borders. Then you've got Spain at L18, but also Australia at 25, CEO, fountain. This is polysemantic noise

fountain. This is polysemantic noise sharing the same slots. So, the facts are there, but it's buried alongside unrelated concepts. And then in the

unrelated concepts. And then in the output layers L28 to 33, really French is being committed to an answer. It's

dominating. But you also see German, European, Western, all plausible next tokens depending on the context. The

syntax gathers the context, the edges store the knowledge, and the output commits to the token. And that's the kind of three-stage architecture of every transformer. But if we wanted to,

every transformer. But if we wanted to, we could pull on one of those relations.

So, I'm going to do a So, I'm going to do a select star from edges. As I said, this is a graph

edges. As I said, this is a graph database, and I'm going to say where entity, and the entity in this case is France. And we're going to look at a

France. And we're going to look at a specific relation, and in this case it's going to be borders. And then what we'll also do is we will put in a score filter just to cut off any noise and look at the things that are really the most

important thing. So, if we look at what

important thing. So, if we look at what comes back, we can see L25, feature 5067 for the token country. You can see on the also variations of the spelling

country, three capitalization variants of the same token. This is a real fact.

France borders other countries, and it's all stored in one feature at one layer.

And if we want to, we can drill into some of the other relationships. So,

this time we'll drill into nationality.

So, let's just hit clear there for a second, and we're going to say select star from edges uh where entity again is equal to France,

and this time relation is equal to nationality. And again, I will limit

nationality. And again, I will limit this down to five rows. And And again, let me make this clear. I am querying the real model weights here. So, again,

if I look at the result from this, look at L19 there, feature 4924. We see

Germany, we see Sweden, we see Italy. We

can see this sort of nationality relationship there. So, again, in that

relationship there. So, again, in that one feature, it's storing all of the different types of nationalities that occur there. And again, that's

occur there. And again, that's polysemantic again. And again, down here

polysemantic again. And again, down here we see the same thing in layer 20, feature 8164. We see Greece, Hungary, Thailand. So, you can see that

Hungary, Thailand. So, you can see that happening in across the layers. This is

how the model is grouping countries by nationality relation. So, you're

nationality relation. So, you're starting to get this idea that was going on here. We have entities, and then we

on here. We have entities, and then we have features, and then we have relations that are connecting them together. And again, we can also look at

together. And again, we can also look at different other entities that are in the model. So, if we do a describe Einstein,

model. So, if we do a describe Einstein, for example, let's just see what that looks like. And

again, you see the same three-stage structure that we had before, syntax, edges, and output. But notice how different it is there.

Probably the ones that are important there, you see the you know, physics coming through.

You see the award relation, and then you see Nobel Prize.

So, that's pretty interesting there. And

again, you start to see physics on the output, quantum, scientific, all things that are related. Now, one of the things that we'll also support here is a verbose mode for the describe. So, I'm

going to look at Einstein there. So,

here you see the sort of the the labels that the the probe picked up there.

You know, award for example for Nobel.

But actually, if we look in here, you can also see with the also column other things that sit associated with that relationship. So, you can see academy,

relationship. So, you can see academy, you can see Nobel, you can see Nobel with a different spelling there. And

again, you see astronomy there, also NASA etc. So physics gravity gravity quantum.

You see different things associated with quantum. So, you get a different idea of

quantum. So, you get a different idea of what the model is associating with the with the entity. Now, coming

back to the France example for a second, I can also do queries like select star from edges, and this time I'm going to ask for things that are nearest to

France. And again, I can specify a

France. And again, I can specify a layer. So, I'll say at layer 26. And

layer. So, I'll say at layer 26. And

then we will limit that down to 10 rows. Now, if we look at that,

10 rows. Now, if we look at that, feature 9348, the token is Australia.

And then you also see Italy, Germany, Spain. So, you an idea of a country

Spain. So, you an idea of a country cluster.

French is another exciting one. French

has different spellings of French there.

We see euros clustering together. You

see the the euro symbol, EU, euros.

I think that's pretty cool. And then if we look at channel, channel, channel, channel, it starts to think about your geographic neighbors as well. So, let's

go a little bit raw here. That was a pretty clean view. Let's look at the feedforward network underneath the hood there.

Again, everything we're looking at is the FFN, which is really the knowledge store where all of the data stored. Attention

is slightly different. We'll talk about that later.

But if I do something like show features, and again we'll limit it to layer 26. Feature two is really

layer 26. Feature two is really interesting. Look at feature two, five.

interesting. Look at feature two, five.

It's got this concept of five. So, you

see five, five, sock. So, it's all the various representations of the number five there. Again, if we look here F9

five there. Again, if we look here F9 for runners, runners, runner, runner, runners. Every capitalization variant.

runners. Every capitalization variant.

So, it's a sort of morphological cluster there.

So, F11 is really about discipline. You

see various spellings, but also concepts, right? So, the misspellings at

concepts, right? So, the misspellings at the top token there. But you see discipline,

token there. But you see discipline, discipline, self, etc. So, lots of variations within there. And if I wanted to, I can also query it using a select statement. So, you can see

statement. So, you can see just loads of features kind of coming back there. So, if you want to spend

back there. So, if you want to spend some time running some select queries over there, you can just go and have a look at every single feature that exists. So, we've got what a feature is,

exists. So, we've got what a feature is, but how does that relate to how a model stores that within the weight? So,

basically, a feature is a single column in the FFN.

It's one gate vector that decides when it fires, and one down vector that decides what it outputs. Gate times

down. That is basically the edge. So,

it's an edge in the graph. And the gate is a direction in the residual stream.

I'm not going to go into too much detail of what a residual stream is. I've

covered that in my other videos. But

when the model's internal state at that layer has a high cosine similarity with the gate direction, the feature is going to activate. And then the down vector

to activate. And then the down vector then adds its contribution to the output, pushing the next token prediction towards that specific answer.

So, we already saw that France's border relation lives at layer 25, feature 5067.

So, let's look at that feature directly.

And again, because it's a database, we can just do select star from features where a layer is equal to 25, and

feature is equal to 5067. It really is as simple as that. And we'll run that, and you can see there you go, layer 25, feature 5067. You see country, country,

feature 5067. You see country, country, country, and the relation is borders, and there is a score. And France is basically one of the many entities that pass through this slot. The feature

doesn't belong to France. France is just one node connected by this edge. So, we

saw the countries cluster near France in earlier on. We saw this Australia

earlier on. We saw this Australia relation on feature 9348. Now, we could just drill into it a little bit more. So, I'm

going to do a select star from features where feature is equal to 9348 and the layer is equal to 26. And if we look at that there, you can see the Australia token,

but it's also got Italy, Germany, and Spain.

That feature is not an Australia feature. That's the key thing. It's a

feature. That's the key thing. It's a

Western nations feature. The model has compressed multiple countries into one slot because they appear in similar contexts, and that's basically polysematicity.

Now, if we wanted to, we could trace that across all the layers. So, I'm just going to take and layer equals 26 away.

And if we look down here, there's layer 26 or Australia one. But if we look at the earlier layers, I mean, they're completely different, different concept

at each layer. The index is reused. The

knowledge is independent. Each layer has its own gate and down matrices.

So, if we look at feature 9348 layer two, for example, you can see it's completely different.

And again, we can do this for other features as well.

So, if I pick feature 1484, for example, and we have a look at that, you can see at layer two, planet, those Japanese symbols are basically the Japanese for

planet. So, it has this concept of

planet. So, it has this concept of planet as a feature.

Same concept, multiple languages.

But if we look at layer six, for example foods.

Again, completely unrelated concepts with the same feature in the same slot.

But again, if we come down to L23, you can see Arizona, and you can see therefore that is a capital. So, notice

capital. So, the probe has picked that up as a capital feature, essentially. And you can see

feature, essentially. And you can see Arizona, Phoenix, Phoenix, and Phoenix, of course, is the capital of Arizona.

And that's what polysemantic means in practice. The model reuses those 10,240

practice. The model reuses those 10,240 slots at every layer for different knowledge. And Larco shows you this. One

knowledge. And Larco shows you this. One

query, all 34 layers, the full life of a feature index throughout the network.

So, if we come back to the graph database, which the model is represented underneath, entities are nodes, and features are edges, and relations are

basically the labels on those edges. And

the probe that I ran on this discovered them automatically. They were the

them automatically. They were the relations that already existed the model created whilst it was under training.

And again, Larco allows you to query by relation across the entire graph. But

what if I want to see the relations?

Well, all I need to do there is I can just type in show relations. And then

you can see a list of the relations that the model knows about. It's got its probe discovered name there, and how many features are associated with that.

And as you see, 1,489 probe confirmed relation labels. The top

30 read like a knowledge graph schema, cuz that's what it is. Manufacturer has

76 features, league has got 60, 52, genre 52, language 46. Nobody taught the model the schema. The model learned these categories because that is how the

world is structured. Things have makers, places have capitals, people have occupations.

The FFN reinvented a relational schema from raw text. Now, if we wanted to, we can even query by relations. If I do a select star from edges where relations

is equal to capital, we'll limit it by 10.

You are going to see everything that is associated with the capital relationship. So, you get the

capital relationship. So, you get the idea. Basically, 32 features across the

idea. Basically, 32 features across the network store capital city knowledge, whether it's Washington, Canberra, Brasilia there on feature 8799,

or whether it's the Phoenix, Arizona one on 1484, which is the state capital fact.

And again, we can look for that as well.

So, if I do a select star from features, and again, it doesn't matter whether I write edges or features. Edges is just looking at the graph as a whole.

Features, I'm just being specific to the kind of features edge there. So, if I say select star

edge there. So, if I say select star from features where layer is equal to 23 token is equal to Arizona, you can see it comes back with

L23. So, I can sit and mix and match and

L23. So, I can sit and mix and match and do whatever. So, let's come back to one

do whatever. So, let's come back to one of the earlier ones. So, if I do a select star from edges, and we'll do a nearest to, and I think Einstein was one of the

ones we were looking at earlier, and we'll put it at layer 26, and we'll limit it by 10. And then this will give you the idea of the the award

the Nobel features or the award winners.

And again, if we look at that, feature 4874 is the one that we talked about there, academy. You know, Nobel, Nobel

there, academy. You know, Nobel, Nobel Prize, etc. The misbalance, etc. All all very good.

But you also see things like brain, brains, brain, etc. And then again, you see particle, particle physics, etc. So, you can you can see that Einstein is

associated with being very smart. It's

associated with particle physics, for example, but also the Nobel Prize winning or award winning.

And again, if you want to dive a little bit more into what entities the model knows about, we discovered France earlier, then I can just do a select star from entities, and

we will limit by 20. And then you can see some of the entities that are sitting there. So, you

see national, you see Australia, Chinese Microsoft Google TikTok etc. It gives you an idea of all the different entities within there. And it

gives you an idea of the number of features that are associated with that entity. You see what I'm saying?

entity. You see what I'm saying?

Complete graph database. And again, if I want to

graph database. And again, if I want to look at a specific layer, then I can just pick a layer and see what's associated with that. So, in this case, I've case I've limited it to layer 26,

and I can see what entities are associated with layer 26.

So, everything so far has been about browsing the graph. And and the graph is messy. Feature 9348 fires for Australia,

messy. Feature 9348 fires for Australia, Italy, Germany, and Spain. The capital

relation returns for Washington, Canberra, and Brasilia in one cluster.

Describe France shows CEO and fountain alongside real facts. So, and why is that the case? Because each feature is one-dimensional. One gate vector, one

one-dimensional. One gate vector, one score, a single scalar activation. So,

when the gate fires, it can't distinguish why it fired. Was the input about France the country, France the language origin, or France the neighbor of Germany? The feature compresses all

of Germany? The feature compresses all of this context basically into one number. And that's the polysemanticity

number. And that's the polysemanticity problem. It's not a bug, it's a

problem. It's not a bug, it's a dimensionality constraint. 10,240

dimensionality constraint. 10,240 features per layer is a lot, but the residual stream is 2,560 dimensional.

The model has to project a high-dimensional space down to scalar activations. Every projection loses

activations. Every projection loses information. Every feature is a shadow

information. Every feature is a shadow of the full representation. So, how does the model actually answer the capital of France correctly? And the answer is

France correctly? And the answer is attention. Attention is all you need.

attention. Attention is all you need.

Attention operates in the full dimensional residual stream, not in the one-dimensional feature space. The query

capital of France creates a specific pattern across all 2,560 dimensions, and the attention heads at each layer match that pattern against the key vectors,

selecting which features to write the signal through and which to suppress.

That's the polysemantic noise. CEO,

fountain, Australia gets low attention weight because the query pattern doesn't align with those directions. One feature

can't tell you what the capital of France is. 34 layers of attention

France is. 34 layers of attention weighted features can. The features are the edges, attention is the routing. You

need both. Let's do an inference, cuz I haven't done that yet. Let's do infer the capital of France is, and we'll say top five. So, I want

you to see here what it's done here. So,

it's came back with the predictions, and as you would imagine, the token Paris has came back with 80%. So, this is actual model inference. This is next token prediction. This is attention

token prediction. This is attention and the graph combining together to do that prediction of the next token, which in this case is France. Now, what is

interesting about Larco here is it allows you to do an inference trace across the layers on how it got to the answer. And this is the key thing. You

answer. And this is the key thing. You

can see every single feature that would activate on these layers, and you would see the difference that attention makes there, and how it gets to the final answer. And here's the

thing that might not have been obvious what I did there. When I run inference, and this is the exact same next token prediction that you do on,

you know, any LLM, when I run infer there, I want you to notice the walk FFN part. I'm not doing

a matrix multiplication. I'm doing a graph walk because I'm using a format called the V index format, where I decompose all the matrices

into a graph structure. The knowledge

half of each layer is basically doing a graph walk. Infer is a graph walk. And

graph walk. Infer is a graph walk. And

at each layer, the inference engine takes the current residual stream, does a KNN look up against the gate factors, finding which features are nearest neighbors to the current state, and then

the match features fire. It's the same thing that happens with the dense models. The down vectors accumulate into the residual stream as before, and then the walk moves to the

the next layer. Attention is still doing matrix multiplications, QKV projections, and our projections remain the same as dense matrices. Attention is that

dense matrices. Attention is that routing mechanism. It selects which path

routing mechanism. It selects which path the walk takes through the graph. The

FFN is the graph, attention is the navigator, and together they produce the forward pass. And here's the thing,

forward pass. And here's the thing, although in Larcal, I've turned this FFN into a graph,

the FFN was always a graph. The matrix

was just an inefficient way of encoding it, and all I've done here with V index in Larcal is I've removed that encoding, and and I query the structure directly.

It's the same weights in in the model.

I've just reorganized it, and I'm doing a KNN look up walk.

This is the thing. So, when I say the model is a graph, I'm literally saying the model is a graph. The graph is messy, but the the walk is precise, and

attention is the thing that gets you through it. Now, here is the cool thing.

through it. Now, here is the cool thing.

So far, we've been reading the graph.

But, because it is a graph and because it is a database, we can also write it.

So, let's take something like infer, we'll say the capital of Atlantis uh is, and then we'll ask for uh top

five answers. Now, of course, in this

five answers. Now, of course, in this case, it doesn't know what the capital of Atlantis is. Um and we're going to see the model's going to do some guessing, some good old hallucination,

and it's just going to say believed, um said, etc. And you can And you can literally see there that it the model has got no idea. So, there is nothing in

the graph, there's nothing in the model for uh Atlantis. So, what we can do is we can we can run an insert. We can give it a fact. So, let's clear this for a

second, and we're going to do an insert into edges.

Um we will pass an entity, relation, and target.

This is cool, right? And we'll say values, and we'll pass in Atlantis, so we're going to give uh the Atlantis as the entity. We're going to say it's

the entity. We're going to say it's going to have the capital relationship.

And we're going to say Poseidon is the capital of Atlantis. So, one

statement, the insert pipeline captures the model's residual at 26 for the canonical prompt, the capital of Atlantis is, and engineers a gate vector from that direction, synthesizes a down

vector point towards where Poseidon, and installs the gate up down triple into a free feature slot.

A balancer. So, we have a balancer on this. Um the balancer is really

this. Um the balancer is really important so that we deal with all the kind of the all the dimensionality, and it basically scales down the vectors so that the fact lands at the exact right

strength, strong enough to be top one on a canonical prompt, but not so strong that ends up hijacking all the other capital queries. Remember we said that

capital queries. Remember we said that those features, they all have uh you know, lots of different things uh in the same dimensions.

So, remember that feature is not satisfying uh one entity. There is lots and lots of different entities that are related as part of that feature. Um so,

it just needs to not be strong enough to hijack it. So, now, if I infer, we'll do

hijack it. So, now, if I infer, we'll do the capital of Atlantis is, and we'll just run this one more time. And if we look here, you see predictions, Poseidon

is now there, 99.98%.

It's at the top, the fact is installed.

And again, if I want to, I can just um And if I want to, I can just infer the capital of France one more time. So, we

just want to check that we haven't broken uh Paris. There you go, Paris at uh 81% prediction. So, we haven't broken it whatsoever. There's no leakage,

it whatsoever. There's no leakage, nothing's bleeding, it's all good. And

again, if I want to, I can just do a describe Atlantis now in my graph, and there we can see all my edges. I've got

my new edge, Poseidon. It's visible in there. Um and you see it's related um to

there. Um and you see it's related um to the capital uh feature. Now, you're

probably thinking to yourself, that's cool. Um And by the way, this lives in

cool. Um And by the way, this lives in what I call a patch overlay. So,

whenever I'm connected to the session, I leave the the real V index, the weights uh as read-only. And then basically, what I'm doing is a run time edit over the top of the base V index. So, if I

want to make it permanent, then I can just do a compilation, and then I can save it for real. So, all I need to do

is do a compile um current into V index, and then we just say temp and Atlantis

uh .V index,

uh .V index, and then we'll just run that, and that will take a second, and then that will just basically bake that into the weights forever. And the way that I do

weights forever. And the way that I do that is I just use the the Memmet technique. Um so, there is a paper on

technique. Um so, there is a paper on that which allows you to uh uh basically um make an uh any factor. So, this bakes the fact into a

factor. So, this bakes the fact into a standalone V index. The insert gate up and down vectors are all written into the canonical weight files. No overlay,

no side car, no special lore. And

basically, it's a normal V index file.

And then, if I want to, I can take that V index file. There is part of the CLI, there is actually an export functionality, and I can export it back out to safe tensors, or I can export it

out to GGUF, and then it will basically work with every uh other model provider. And again, if I want to use that new index there, I can just fire up Larcal one more time, and

this time I will just connect to my temporary V index. Um you can see it's now connected up there. And then, if I run the infer the capital of Atlantis is,

um this is a fresh session. I've

connected directly to the new compiled V index, and you're going to see that the capital of Atlantis is, of course, Poseidon. And there it is at the top

Poseidon. And there it is at the top there. The fact that I had to compile is

there. The fact that I had to compile is basically in the bytes now. And again, I can compile it down to safe tensors, GGUF. So, there you go. Um I think I've

GGUF. So, there you go. Um I think I've definitively proved that the model, or at least the FFN part of the model, is a database. It's a graph database. You

database. It's a graph database. You

literally watched me query the weights directly. We've done select statements.

directly. We've done select statements.

We've inserted new knowledge. We've

compiled it back down into weights.

Um it's it's a database. You can select, you can describe, you can see the entities, you can see the relationships.

It's pretty cool. And that's all without training. That is just standard database

training. That is just standard database stuff, and that's why I created the Larcal language. Um you saw me work that

Larcal language. Um you saw me work that against the Gemma model, but this will pretty much work against any any model.

I There's a few things I need to do to tweak to make it work with the various different models and test it against there. It works great against the Gemma

there. It works great against the Gemma 3 4B model. Uh I'm going to do the Gemma 4 model pretty soon cuz I think I've got something even cooler to show you.

Because the impact here is huge.

Remember in Larcal, you saw me actually doing inference by doing a KNN walk. I'm

walking the graph database as opposed to doing matrix multiplications.

The implications of that are absolutely huge. I was editing a model there also

huge. I was editing a model there also without training. The implications of

without training. The implications of that are huge. And I think when I get into my next video, I think you're going to see why it's huge because

it means what I've essentially done is decoupled attention from the largest part of the model, which is the knowledge store.

And therefore, if I've decoupled attention from the knowledge store, well, the knowledge store doesn't need to live on the same machine as attention. It can

live on a different server. It can be on a remote web server. And again, I'm going to show you on one of the next videos. But it also means that we can

videos. But it also means that we can get super efficient of how we load the model. So, one of the other things we're

model. So, one of the other things we're going to be able to do is actually run some of the largest models, and I mean uh the Gemma 4 31B

model, we can run locally on a laptop.

In fact, I think we could probably take the largest models like the uh even the Gemini K2 models, and we could run that on a laptop as well. But that that's for a future video.

Um And finally, I think you're probably going to see there that if I can insert patch knowledge there into the database, well, we could probably do

What's the best way of saying this?

Building models from scratch, training free?

Well, you have to wait for another video to see that. Anyway, I hope this video's

see that. Anyway, I hope this video's been useful, and I hope you get the idea that the model is just a graph database.

Cheers.

Loading...

Loading video analysis...