“Protocol learning and unextractable protocol models”

By cyber•Fund

Summary

## Key takeaways - **Base Models as Cultural Infrastructure**: Base models are not just fundamental software infrastructure but based cultural infrastructure that inject values and opinions into everything they touch, and few entities controlling them is very dangerous. [00:30], [00:43] - **Disconnect in AI Control Perceptions**: Entities producing these models act as if control is existential, creating a massive disconnect from the public narrative that everything is fine, similar to Uber in 2015 before monetization shifts alignment. [01:04], [01:31] - **Unsustainable Open Source AI**: Current open source AI relies on someone spending huge money to give results away for free or self-dependency on Chinese companies releasing open weights, which is not sustainable like open source software because producing them is very expensive. [02:13], [02:44] - **Low-Bandwidth Model Parallelism Breakthrough**: Protocol learning enables splitting transformer models across consumer devices by exploiting low-rank properties in output projections and activations, allowing compression of communications so small devices can train models bigger than they could alone. [06:08], [08:23] - **Unextractability Prevents Value Leakage**: By injecting identity transforms split into orthogonal components between blocks, the model function remains unaltered but weights are scrambled, making extracted partial weights incompatible and reconstruction costing up to 100% of training from scratch. [11:49], [13:33] - **Live Run with 300 Global Participants**: A 7.5 billion parameter model was trained on 1,642 GPUs from 300 people across 198 cities with 36 chipsets and average 465 Mbps bandwidth, showing no loss spikes and enabling power-law revenue distribution back to contributors based on flops. [14:46], [16:07]

Topics Covered

AI Models Inject Values
Open Source AI Unsustainable
Compress Activations Low Rank
Unextractability Prevents Extraction
Heterogeneous Training Proven Scalable

Full Transcript

Okay. Hello everyone. My name is Alexander, founder of Parales. We work

on protocol learning. Protocol learning

is a way of creating the base models and protocols in large heterogenous swarms of computer heterogeneous compute and trying to add sustainable economics to

this whole process.

I think this is important because the base models are going to be a substrate for everything. These are not just

for everything. These are not just fundamental software infrastructure.

These are based cultural infrastructure.

I think by far this is now the consensus opinion that these are going to have very very large impact and that there's going to be very few sort of entities that control these things.

And I think this is very dangerous. I

think these things not just affect like they're not just critical dependency but they inject values and opinion into everything they touch. And we should be more worried about this than we sort of

are. the entities that are capable of

are. the entities that are capable of producing these models today are acting as if this is existential. So I think there's a massive disconnect sort of between the groups that are actually producing these models and the general

public narrative and opinion.

These are the actions of groups that have decided like they cannot lose control of this layer. Whereas I I think sort of the uh public opinion is everything's fine and and um we've got no issues.

And people might say to me well Alex I use chip all the time everything is fine. I'm not really worried about open

fine. I'm not really worried about open AI like having total complete control of the layer and I think there's some truth to that today but I would just try to make the point that right now we are

like in Uber in 2015 like everything is great right the companies are maximizing for user growth they are like maximally aligned with the users it will not stay this way and eventually these models

start to get monetized and the companies will not be aligned with you as user uh when that happens you basically have like some of the most powerful adver sizing machines ever. You will buy

things you don't even know if you wanted to buy it or if the thing sort of tricked you into buying it. And I think I think this is uh not good. And by the time that happens, there's really few alternatives. These will be very

alternatives. These will be very entrenched systems. And I just want to make the point that I don't think we have a reasonable alternative in actual open AI, open

source AI today. Right now, open source AI refers to someone somewhere spending a huge amount of money and giving the result away for free. I don't think this

is sustainable. Right now, we have uh

is sustainable. Right now, we have uh sort of somewhat of a self-dependency on Chinese companies releasing open weights. I know there are new attempts

weights. I know there are new attempts to try and bring like open core style development to this process, but I think the fundamental thing here is just that the dynamics are different.

This is not like open source software.

It's very expensive to produce those things and so open source won't save us.

What I think you need to have is a way to actually create these things like I said in heterogeneous swarms of consumer grade hardware and a way to actually

make that sustainable. Not okay we raise a bunch of VC money and we train a few models as one-off things, but that the economics of the system as a whole is actually consistent. And

actually consistent. And I don't think there's many people that would disagree with me that that's a good thing. The the problem is like can

good thing. The the problem is like can you even actually do something like that? Until a year ago, there was really

that? Until a year ago, there was really no path to getting something like that to work.

You know, I I call this protocol learning, but it refers to like a very specific thing, right, in the AI literature, which is heterogeneous multi-participant low bandwidth model parallel with this unextractability

property. I'm going to go through and

property. I'm going to go through and like explain what each of those mean.

But the fundamental problem here is low bandwidth model parallel. Could you

train in such a way where the model itself is split up over devices? This is

what you have to do if you want like non-data center compute to be useful.

And yeah, a year ago I would say like no one really thought that could work. At

Plurales we have some very exciting results I would like to share. So we

just had three papers accepted at Europe. These are main track papers.

Europe. These are main track papers.

These are not collaborations. These are

sole pluralis authors. We were less than 10 people when we wrote these. So this

is real stuff. These are not technical reports. And these address the core

reports. And these address the core technical problems, the research problems that you would need for any of this to work. We had one on pipeline parallel that's public. The context

parallel is not public. I'm going to talk about briefly today. And then we have a paper on the unextractability which like I said in my view allows you to inject actual sustainability to this whole process and is quite novel. And

then we also had a live run which sort of operationalized this stuff.

Um I want to make a very core distinction at the start which is there's like decentralized training and distributed training and all these kinds of and and now I'm introducing a new term protocol learning.

There's a fundamental difference between training in a data parallel way and a model parallel way that I think this nuance has been like lost.

When you do data parallel training, you replicate the model over nodes. So this

is things like dloco and demo and um power sgd and federated averaging all these sort of existing approaches where you can train a small model on small nodes or you can train a big model on

big nodes but you cannot train a big model on small nodes right you can get 8x H100s you can replicate the model you can train probably like a 32 bill maybe

bigger maybe up to like a 70 but beyond that suddenly you need multiple racks of H100s to all fit the node and they all need to the fast interconnects. If you

want to do model parallelism, you're in a different regime where you're splitting the computational graph.

So suddenly like within the model communication has to become uh extremely compressed. And this is the thing where

compressed. And this is the thing where there was some work if people are familiar with swarm parallelism by Max it together, I'd say that was the one sort of precursor work to to this kind

of thing. But it didn't work because it

of thing. But it didn't work because it all the way because it didn't have the compression and This is the thing we're most excited about. This is our sort of main research

about. This is our sort of main research result we think is is really cool. We're

calling this protocol models because it allows you to train across the consumer devices.

Uh and it solves this problem. Can you

can you actually split the model up over GPUs?

The problem here is if you do anything to those activations and activation gradients, you just change the training dynamics.

If you do this in any kind of normal unprincipled way, um I won't go into like massive technical detail, but I want to give you guys the intuition.

What we're doing here is exploiting a property that is actually a known thing in transformer networks. So these are not new architectures. This is not some crazy esoteric thing. It was known that

in transformers the output projection weights as in WP1, WP2, these are low rank. They become low rank during

rank. They become low rank during training.

However, activations don't. And uh what Samiro and our team figured out was that the reason the output activations actually stay high rank is it's from the token embeddings and positional embeddings that sort of cascade through

the network through the residuals. And

what he figured out was okay, the position embeddings, they're fixed. You

can take them out, c them, put them on every block, but token embeddings are data dependent. So you can't you can't

data dependent. So you can't you can't do that. and he came up with a very good

do that. and he came up with a very good solution here where you basically split your token embeddings into a fixed high rank component which you c each block and you inject back in in a principled

way and then you have these trainable low rank components and if you do this and I'm skipping a lot of detail here right but if you do this uh fundamentally what happens is the signal

coming out of the activations is very low rank you can compress it and then the backward gradients are also low rank you can compress it and you we also have another trick in here where we actually let the subspace vary which means the output projection

weights like are moving in the full ambient space while the um the actual activations and activating gradients uh remain low rank. This is what lets you compress

and it means we can put you know one transformer block on one device, one transformer block on another device. We

have our live run there was like a 16 16 GB cards that uh were holding single blocks of bigger models and it gives you this really key property. small devices

can train models bigger than they could in any other setting by themselves.

Uh we have other work here which is okay if pipeline parallel works the very next thing you run into we show pipeline parallel works for sort of 8 to 4 to 8k context

most people I'm guessing know that when you do mid-training it's very important that you can increase the context length of the models and this increases the memory requirements on the node significantly.

So very fortunately, not very fortunate, we found that like very similar uh techniques can be used on the attention

matrices because the um QK matrices are also very low stable rank. This is a bit more involved and there's sort of a bit more stuff going on here.

I would say of these two works, the pipeline parallel work for us feels very complete that that's just simple and correct and is working. I think there's a little bit more we're going to need to do here.

Uh but you can see even with the sort of current system you're just shaving you're bringing this from like too slow to overwork to okay it's now in the realm of like you can train this with

like a small uh convergence penalty here.

So just on that graph like purple is without and then uh blue is with this.

Okay. So at this point, great Alex, maybe we can train with small devices, train big models, people can volunteer their compute. I think that gets you uh

their compute. I think that gets you uh not to the point where you can train really, really big models, right?

Because I don't think volunteer compute ever gets far enough.

What you have to have is a way to implement programmatic incentives where the value flow from downstream infra revenue of that model can flow back to the training contributors. This to me is like the really interesting thing and

it's what I've been interested in since we started this whole thing. The problem

is if you are training in the DDP way where you're giving everyone a full weight set or even if you're training in the normal way where you're giving people blocks, if any one person can go

on and they can take out each weight set and they can assemble a full weight set of that model outside of the protocol, you can capture some value, but the value leaks.

all of that effort you just put into into all these flops and and compute to try and train the model. Suddenly

someone can take the result they can stand it up somewhere else. It's like

very hard to sort of uh return the value that you created back to the contributors.

So what we would like is this very key property of unextractability. Can people

collaboratively train without the weight set uh coming out? And it might seem like how is that even possible? Um we

show that you can do it. I'm going to build intuition here in like the pipeline parallel case, but we actually apply this to to the general case as well. So in pipeline parallelism, the

well. So in pipeline parallelism, the model is split up. Like I said, one one participant is seeing one block or stage of a much larger model. So you already

get a benefit. Like I'm participant one, I see layer one.

But if I'm very smart, I can go well, I'll join the protocol, I'll leave, I'll join again, I'll leave. I'll join again, I'll leave. And maybe eventually I

I'll leave. And maybe eventually I accumulate enough blocks that I can extract the full model weight. And this

happens uh and is especially dangerous at the last sort of last part of training. When the weights are stable,

training. When the weights are stable, they're not moving much. You might get them over a few time steps, but you you're going to be able to take the model weight out.

And we do something I think is very clever, which is you inject a identity transform between the blocks. So you

don't alter the actual overall model function. you're just injecting an

function. you're just injecting an identity and then you split that identity into two orthogonal transforms T and T inverse and then you fold those

into the blocks of the model and I would encourage people to read the paper because a bit of detail but fundamentally transformer blocks and the way they're structured with the residuals means when you fold those

transforms back into the block you actually obscure almost all the parameters and this is really cool. This

is like okay I didn't alter the overall network function f the model performance is unaltered but I've now scrambled all the blocks inside the model and I only required communication between two

blocks I didn't require any global state uh any knowledge of anything about the whole weight set and uh you can kind of visualize it in this cool way where if you imagine here

like each each sort of cylinder is the block you're periodically applying these two trans these transforms at every sort of interconnected stage and the model is morphing over time. So

despite the model performing exactly the same from the perspective of the end user, the actual weights are constantly morphing and changing. And it means that if someone does that kind of attack where they go in, they get a block, they

go in another time, they get another block, they take it out, and they try and assemble these after that those weight sets are fundamentally incompatible.

And we show in this paper that like even if you did this and you you tried to like fine-tune it with the exact data set that was used to train the model, your compute cost is 60% of just training the whole thing from scratch.

[sighs] Like I said, this paper's accepted and will appear at Europs, which means we did most of this like six months ago.

There's follow-up work now which has like this at uh 100% right. The the cost to to reconstruct is as much as training from scratch.

And um and we also show like sort of empirical results that you can do these transforms for a long period of time.

Numerical imprecision doesn't accumulate and um the the model stays the same.

Okay. So that gives you a way of injecting like in my view some level of like sustainability and monetization in an open training process.

It's it shows that you can do this in like this low bandwidth way at least in the research. And there now becomes a

the research. And there now becomes a question of well can you deal with massively different devices in terms of capability? Can you train with a 3090

capability? Can you train with a 3090 alongside a H100 and can you even do multi-participant training at all?

That's not really uh something that people have done to date.

So we just wrapped up one of these runs.

I'm going to look up the exact stats here.

Um, but we trained a seven bill model, 7.5 bill, and we just posted on Twitter.

So, we were like, look, this run is open. Anyone can come and join it.

open. Anyone can come and join it.

There's no white list or or thing stopping anyone. And we had 300 people

stopping anyone. And we had 300 people uh ended up joining. We had 1,642 GPUs total was over 198 cities. Most of these

devices were not in data centers.

uh 36 different chipsets ranging from H100s to 390s to 490s all the way down to a T4 and people joined with like horrendously

bad bandwidth like horrendous like the average was 465 megabits a second 128 nodes contributed with less than 100

megabit per second so that's like Starlink speed download and we're able to actually like process batches and contribute to this thing we didn't have any loss loss spikes. The loss curve is

there. The reason we did a 7.5 is like

there. The reason we did a 7.5 is like if anyone knows large scale training dynamics, like loss spikes show up after about 5 billion. And uh that yeah, that loss went down basically completely

unaffected.

And we had, you know, we logged sort of people's floating point contributions and we gave people a score. And on the top left there, you can see if we were to monetize this model, there would be like basically a power law of revenue

ownership that would flow back to those training contributors.

And this is basically productionizing the first of those papers, the the pipeline parallel version. There's no

um context parallel or unextractability in that in that proof of concept.

Okay. So, I just sort of want to get people a bit excited. I think there's like real stuff here. This is now accepted in like the tier one venues.

Until now, AI people have not really crossed over into this sort of area.

It's been sort of a bit of a a re an anti-reaction. I think that cannot

anti-reaction. I think that cannot continue while these papers are starting to show up and I think what this results in is just fundamentally really exciting. I started the talk by talking

exciting. I started the talk by talking about things we want to avoid. It's much

better to think about what this enables at least for me. Suddenly you can do like model lineage flow right like when you design these models you don't just do one big run. You do a pre-train then

you might do a mid train. You might fork off somewhere and do something else. you

might merge the models and if you can maintain unextractability in that whole process suddenly like model derivatives can flow back to the base model all this stuff gets really really exciting I

think and um you know you're allowing sort of compute capital to come in and speculate at the model level there's a lot of uh components of crypto that I think just become very directly

applicable and the hope is that you can open up innovation at this layer as well that you actually can have this stuff developed in the open while maintaining the value that's created and it stops

being constrained to these top labs that all these researchers that are in meta because they're the only ones have the GPUs because it's the only way like this level of capital formation can be applied to this layer uh that this can

actually be opened up which uh to me is a really positive uh net good thing and the final result of all this is like we I think we generally have a path here to direct fractional ownership at the model

layer itself which is sort of the base dependency for everything So, thank you everyone and happy to answer questions on that. [applause]

>> Go ahead, Danny.

>> Thank you.

>> What >> when you were uh you were uh sort of like tracking the floating port contributions, what variables went into the scores that people were being assigned? Uh it's literally just we

assigned? Uh it's literally just we figure out how much it costs to produce like that activation or activation gradient for that layer and then we just accumulate flops at like that sort of

chunk if that makes sense. So it's like every valid activation you produce like we know how many flops that would take and then assign it back.

>> And there are advantages that more sophisticated actors have in terms of like better hardware optimizations that they're making that sort of lead to some form of not necessarily centralization but the acrruel of scores to better actors.

>> Yeah. Yeah. Yeah, I'll show you something really cool actually. Um, so

yeah, because we incentivize at the flop level, it means that if you have like like certain devices will just be able to produce those cheaper or better. And

also like if you if you had a lower cost of energy, it would cost you less. We

didn't do this on purpose, but it turned out for this run the most efficient type of card was a 3090 or 490 in terms of like efficiency per pluralis point you got. And then like very quickly everyone

got. And then like very quickly everyone just started using 3090s 4090s and we were like why is everyone doing that? We

checked and it's like oh this gives you the best points for this particular setup. So the interesting thing here I

setup. So the interesting thing here I didn't mention is this is the first example of like some kind of market dynamics starting to show up here which is what I want to do like a market on floatingoint operations in these runs.

>> Thank you. Incredibly cool.

>> Thanks man. [snorts]

Yeah.

>> Yeah. Yeah. I think that's a great point which is there is much more than just compute that's important here. The way I think about data specifically is that it wasn't clear how you could monetize the data while the model the models

themselves were not monetized. So now

the models are monetized or there seems to be a path to doing that. It's very

clear how like in that uh revenue graph I showed there could be a slot for that for like the data set contributors. And it's like I think there's a million ways you could set this up. You could say okay the

model designers bid on the data sets they want to allow in these kinds of things. But because the marginal cost of

things. But because the marginal cost of providing the data is basically zero.

You would expect like these people to make available. I personally have no

make available. I personally have no interest in incentivizing data contributions. My hope is that I there's

contributions. My hope is that I there's a million companies that do this. I I'd

rather partner with them for that and make it available. And then the other part is how do you incentivize model designers in this whole thing like until this I just talked about compute showing

up again. I think there's a lot of ways

up again. I think there's a lot of ways you can do it. I think really really interesting way is if you give the model designer like an early disproportionate mining period to the model where it's

like okay you can contribute 1% of the total flops but instead of that giving you 1% of the ownership it gives you 10 gives you like two things right suddenly the model designer is like buying into

their own model so they're not speculating and proposing gibberish and then two it's like solves that question how do you how do you incentivize a model designer to even care at all how are they thing. It's because they're the

ones that propose the model, so they get this like nice early mining period.

But again, I think there's many ways you could approach that. Yeah.

Go ahead.

>> Yes.

>> Yeah. Yeah. Good question. Good

question. Um, we did this with a very clever way of verification that I think will hold up for a long time, but is it does not have theoretical guarantees. So

if someone really really wanted to, they could probably break this this particular run. We actually have um a

particular run. We actually have um a whole a whole follow-up work that's on just this question that I I didn't talk about today. But you have to have

about today. But you have to have decentralized work verification happening so that a node you you actually have two problems, right? You

have someone that might want to take value out. So they're producing like

value out. So they're producing like gibberish stuff just to get points, but then you might also have people that are like, I don't even care about getting points. I want to break the model. And

points. I want to break the model. And

so you can set up like the economic irrationality thing, but that solves problem one, but not problem two. And

problem two, you got to have like real proper serious decentralized work verification happening. Uh fortunately,

verification happening. Uh fortunately, I think in the model parallel cases actually this actually gives you straightforward ways to implement. I

just uh TBD the work's coming soon.

It'll be public soon. It's in at ICLR right now. Yeah.

right now. Yeah.

>> Thank you.

>> No problem. Thank you.

I I have one question about like scaling this, right? Maybe it's the hard

this, right? Maybe it's the hard question, but like I was like in this um uh panel with like Nathan Lambert earlier this week and he's you know him,

right? he's kind of a bear on like

right? he's kind of a bear on like decentralized training. H and yeah, he

decentralized training. H and yeah, he was talking like specifically around like the hardness of the problem and like the loss spikes uh that they're

even like struggling with like in the centralized setting like yeah I think this is a fantastic like P that this can be done with the properties you showed in the paper but yeah how do we scale it

uh >> yeah yeah I mean I I guess the fundamental point is that these methods are just inherently scalable So if this was a DDP setting, you would have you

would run into the restriction of like what can I fit on device with this model, there would literally be nothing that would just stop you from we could just add another block. We could add another 10 blocks to this particular

model. We could make the blocks just

model. We could make the blocks just bigger and use like instead of a 16 GB GPU or 24 that would be fine, right?

actually um even in the pipeline parallel case it's very straightforward to split the block in half because the intermediate output projection has the exact same properties as the as the

output one so that's the other point that I should emphasize is this is directly scalable we started this at a 7.5 we haven't done this uh above a 32 I

think there's zero immediate obstacles to going straight to 100 million here after 100 maybe Uh but also I think there's things you can do that too.

>> So like the research is there to scale like then maybe it's becomes engineering problems. >> I think there's engineering. Yeah. I I

think the main immediate challenges here are like robustness just dealing with like many many people coming and going of very different hardware types like these are just very hard distributed

systems problems and we'll it's just yeah engineering that has to happen.

>> Okay. I think I Yeah. where we are at the time.

>> I think we can have time for one more question.

>> Um I know I know you think you mentioned that you weren't planning to monetize this model but obviously that's uh required eventually for you know bringing back uh value to the platform.

Uh how are you planning to do that? I

mean these models aren't exactly commodities and um >> yeah yeah what what is your thinking there in in in how to do that >> in terms of how how to monetize?

>> Yeah. like do you have like is it just serving inference uh for the model?

Yeah, I mean from an enduser perspective interacting one of these our goal is to make it exactly the same as calling an open AI API and it's like there is some kind of API call attached with payment

that payment is going into the API the API is like calling the actual protocol model and then that payment is flowing directly back to that sort of ownership graph that's being constructed there and like I said I think there's details on

exactly how you construct the graph like how much you give to compute how much to data how much to designers but that's the overall goal Right. And then do you are there like any kinds of market

dynamics in place there like the kind of model like this the specialties of the model and and that'll determine the value?

>> I think these are more like business questions uh that I maybe haven't answered publicly. I think um I can give

answered publicly. I think um I can give my general thoughts which is just my goal here is I generally think you can produce a best models both through unlocking scale and like uh actual open

innovation. I think the path to that is

innovation. I think the path to that is a long path and part of what I view Plurales's goal as is like going and actually doing those intermediate first models for relatively specific use cases

where we know there's demand and bringing the first like inference revenue into the protocol manually. I

think that's our job. Um and we'll do that until we can begin to actually have this like actually spinning around where people are coming and proposing models because they know there is inference demand and attracting it that way.

The other nice thing here is every single person on that graph would be incentivized to drive use to the model is how they're going to make money.

>> Okay. And thank you Alex very much. That

was really great discussion.

Loading...

Loading video analysis...