LongCut logo

Persistence Squared: Persisting Persistent Data Structures - Juan Pedro Bolivar Puente - CppCon 2025

By CppCon

Summary

Topics Covered

  • Structural Sharing Enables Efficient Immutability
  • Immutable Data Structures Enable Value Semantics at Scale
  • Copy by Value Eliminates Synchronization Overhead
  • Persistent Undo History as Simple Document Collection
  • Pools Preserve Structural Sharing on Disk

Full Transcript

All right, welcome everyone. Uh, excuse

me for the technical difficulties.

Uh, welcome to the presentation persistence squared. And

persistence squared. And if this works, um, [clears throat] so my name is Ju. Uh I'm the CTO of a music tech startup called Bronze. I'm an

also an independent consultant. I've

been working for many years in the fields of music, graphics, films, robotics, real time. And it's precisely because of this involvement in highly

interactive and dynamic systems that I got very interested in persistent data structures. So what is a persistent data

structures. So what is a persistent data structure? When we hear the word

structure? When we hear the word persistent in the context of data structures, we have to forget our intuition that persistent means something like persistent storage,

right? Like somewhere where you save

right? Like somewhere where you save information and that it resists uh rebooting the machine, right? When we

talk about persistent data structures, we mean data structures where the values of the data structure persist after modification. This means that you can

modification. This means that you can change the data structure but whatever you had there before whatever shape the data structure had before is already uh is still there uh for you to access in

some way.

One way and very common way to achieve this is through immutable data structure. Right? We achieve this

structure. Right? We achieve this property in an almost autological way because if something is immutable, if a value is immutable, uh it persists

obviously, right? You cannot modify it.

obviously, right? You cannot modify it.

But some of you may be thinking like cut down on the micro doing or something like [laughter] immutable data structures like the whole point of a data structure is that it

mutates, right? Like it changes. You put

mutates, right? Like it changes. You put

stuff in, you get stuff out.

Hang on for a second. So I'm going to show you the library emmer. So this is a library of immutable data structures that I've presented at CPPCON already a few times before. Actually the I think

the first presentation was in 2017 and not to brag but it's the default uh standard for immutable data structures

uh in C++ uh for C++ developers. And

actually uh it's very production ready.

is used in production systems even critical systems at meta for example we had a project where we really used it in super core infrastructure I know it's

it's used inside Google Amazon uh recently I learned in MongoDB they use it in places as well so it's very solid it's very um well tested you're welcome

to to use it if you like what's presented here so once we have an immer vector uh we can declare a vector that

is const right it's immutable so we can mark it const and we can call the push back method of the vector but since we cannot modify

the previous vector what we're going to get is a new version of the vector right so we had a vector A and now we have a vector B and because the value A

persisted we can do comparisons like say hey actually before we had three items now we have four items right and the two values are there for us to poke around.

We can even do it again, right? So if we want to modify, we cannot use the brackets uh because that cannot like return a new value the way you would expect. Um but we can uh use the set

expect. Um but we can uh use the set method and get a new um vector with some value changed, right? How will we implement this? Well, if you're like a

implement this? Well, if you're like a C++ 23 developer, you may try something like this where you say the class emerctor contains an std vector inside

and we just write wrapper functions like this one or wrapper methods. And here

we're showing off one of the very nice and actually most uh wanted from me uh C++ 23 features which is explicit this right. So what we're doing here is

right. So what we're doing here is saying that this argument of this function we're explicitly declaring it as part of the argument list and in this

way we can take it by value right which is something that you could not do uh in C++ 20 even right there are other fancy things that you can do with this like deducing the type and using universal

reference here we just want to get the vector by value so that uh when I turn my head can you still hear properly? I'm

not sure. Yeah. Um so that we can push back into the member vector. Um but this is a copy right that since we received by value that is already inside this

function logged in and we can return a new version of our data right uh we can do this with the set method as well and we're done right mutable data structures

the talk is ready but of course you will think this this wouldn't work in practice right uh the reason is that a standard vector

uh stores all the data in a continuous vector So that whenever we need to change it in this way, right, if we make a copy of the vector, we're going to have to copy

all the data, right? There's no way to say uh just I don't know modify this little bit here without kind of referencing the rest. Uh we will have to

copy the whole data, right? So if we use this data structure a lot, you know, you're going to get quadratic behaviors and like lots of copies everywhere. uh

you know you're C++ developers you probably are very aware of why you shouldn't do something like this. So

we're going to need a different representation for our vectors. We're

going to use something like this a tree and we're going to use a tree that is very regular. It's perfectly balanced we

very regular. It's perfectly balanced we call here. Uh this is a radics balance

call here. Uh this is a radics balance tree. I will talk a little bit more

tree. I will talk a little bit more about it later. Um what's interesting is that all the blocks all the data is stored in chunks uh that have a fixed

size that is a power of two. So here for presentation purposes I'm using blocks of four elements in practice we will actually use wider blocks. I'll talk

about that later. Now the nice thing is that since now we have a smaller chunks of data that are related to each other.

If we want to modify that vector v 0 that we have represented in that tree, I don't need to copy all the data. I'm

going to actually use this very regular structure to first address directly without doing any search the element that I want to address here. It's like

the index 9. So I can take because I'm using four elements uh the logarithm of four is two or in two. So we can group

the bits that represent um the index of the vector in groups of two. And I'm

going to use the first group of two to index in the root here. Right? So we're

saying we're moving through zero here.

Then we go through uh two here. Uh and then here we have uh I

two here. Uh and then here we have uh I mean two as in C++ index two, right? The

third element. And here the second element J. Right? So here's our J. And

element J. Right? So here's our J. And

now we can copy that chunk. And of

course we're going to need a copy of the parents. Um

parents. Um but we didn't need to copy all the data, right? And this is how we managed to

right? And this is how we managed to modify the data structure. Modify

meaning create a new version, right?

Create a changed version without changing the original one that shares a lot of structure. And this is the key word that you need to remember for this talk is structural sharing. Right? This

is the property of immutable data structures that make them work in a really nice way. Um and that give us really interesting properties. Right?

The fact that we have all these values that are actually from the programmer's point of view purely independent, right?

They're value semantic. They're

completely independent from the interface point of view. But under the hood, they share structure, right? And

we achieve this through this tree structure plus reference counting that you also need to be able to you know track uh when this memory is deallocated or not.

So this data structure as I mentioned is called the radics balance tree and it's characterized by a branching factor m that is a power of two as I said no these blocks need to be very regular

in the slides I was using a branching factor of four in practice I use a branching factor of 32 right so uh the constant B which is actually what you

normally uh use to to define the data structure is five right and it is because of this that a lot of the operations in this data structure Here we call them effective 01 and this is a

little bit controversial. I heard some criticism in the past because of this. I

didn't come up with this term. This term

was actually um invented by the person that made these data structures popular popular which is rich hickey. Rich

Hickeyi is the inventor of the language closure and he made a very bold decision. Uh it's the first programming

decision. Uh it's the first programming language that decided to use this data structure as the default data structure to represent uh vectors in the language.

Right? So if you're using closure and you use a vector, you're getting this data structure. And he argued that this

data structure. And he argued that this is a good representation for a vector because in practice random access in this structure is effective 01, right?

If you know a little bit about algorithms, you will say well technically it's logarithmic, right? Uh

it's logarithmic with this base. But the

problem is or the key thing is that the base of the logarithm is 32, right? What

this means is that with just five steps in you're already in like 40 million elements.

You can with seven or eight element you can address more elements that you can ever fit in any computer that we already have now or any computer that you probably can imagine in the next uh few

decades. So if the maximum depth that

decades. So if the maximum depth that you can go is like seven or eight in the really worst case in the most common case it's going to be three four

Is that number a constant [laughter] or is that number the result of your asintotic uh O? So anyway, you can forget about this effective O1 and say

that it's logarithm, but I think it's um it's an interesting uh idea to think about it in in this way. Um the cool thing about this data structure also is that it's very costefficient, right?

Because uh we're storing data in contiguous chunks. um these chunks are

contiguous chunks. um these chunks are actually wider than cache lines. You can

fit actually a few um cache lines with one of these leaf nodes. So actually

when you iterate over this data structure um in the right way which I will show later uh what's the right way the iteration time is actually

comparable to uh an STD vector right uh there's a very small uh difference um on top of that the actual data structure

that is implemented in EMR is not the RB is the relaxed RB and this gives us the property of logarithmic time in this case real logarithmic time not effective

at one logarithmic um concatenation and this is really cool because this opens the door for things that you cannot do with a standard vector which is inserting in any position of the vector

in logarithmic time. Um this opens the door also for all kinds of new parallel algorithms because for example if you want to do a parallel filter right

normally a parallel filter where you need to produce vectors of different size right so um you cannot really like push in the same vector from different

things without some kind of like locking algorithm. Um with this you can just

algorithm. Um with this you can just distribute the work. Each of the threads is going to you know produce their own filtered vectors on their subset of the work and then you concatenate the

results and get a new vector. This is

quite interesting. Another interesting

property is that growing this vector is not amortized 01. It's actually

effective one. [laughter]

Um and this is really nice because it means that you get a very stable latency and for some use cases this is actually a quite nice property. So yeah this is a

cool data structure that beyond um the immutability aspect it has also other interesting properties. Emer also

interesting properties. Emer also provides has array map trees and this is the equivalent. So we showed vectors no

the equivalent. So we showed vectors no we showed a wide node immutable data structure for continuous sequences. This

is equivalent for unordered maps, for unordered sets. Um, and it's quite

unordered sets. Um, and it's quite similar. I'm not going to go through all

similar. I'm not going to go through all the the differences. Um, uh, the important thing is that it's also characterized by a branching factor of 32. Um, the tree is a little bit less

32. Um, the tree is a little bit less regular because it depends on the prefixes of the hashes, but you also get very um uh very wide nodes with uh

pretty decent um cash locality. actually

better cache locality than with than with a standard and order map because in a standard order map you actually have a linked list uh on every element of the bucket.

Uh again this data structure doesn't require rehashing. So as you grow it and

require rehashing. So as you grow it and add new elements to it you're never going to have you know a stop uh big step which means you get like a constant

or not constant but like predictable latency on all operations. It's also

very compact. Actually, in the project that I mentioned before with meta, um we were replacing some unordered maps uh that were synchronized with locks and we were interested in the data structures

because of the concurrency benefits that they give. But we actually saw

they give. But we actually saw afterwards that we got a 50% uh decrease in in memory usage uh because of moving to this data structure.

This is pretty cool and uh the nice thing about this kind of data structure is that it opens the door for all kinds of new architectures.

It's a kind of architecture that I like to call value semantics at scale, right?

Because we often apply value semantics in little places here and there in our application, but it's difficult to kind of move the big values, the big documents, the big things in our application to pure value semantic.

and value semantics at scale is characterized by I would say by three major advantages. The one is the first

major advantages. The one is the first one is concurrent programming.

This one is the mo the most obvious one, right? Let's say we have a document and

right? Let's say we have a document and this document um is using data structure of some some sort. And if we're using mutable data structure, right, we're going to have one version of the

document that everybody in our application is trying to access. the UI

is like triggering actions that modify it and at some point you want to save it to disk. Well, if everybody's accessing

to disk. Well, if everybody's accessing this and if we want to do this operation concurrently, we're going to have to use some kind of synchronization mechanism.

In this case, I just put a mutex here, but this means that you know this operation is not actually concurrent fully, right? So, um if the saving to

fully, right? So, um if the saving to this takes a while and the user tries to modify the the thing, they're going to get a spinning wheel. It's not so nice.

plus all the problems that mutxes have of like they're a little bit of an errorrone technique. Um, if you use

errorrone technique. Um, if you use immutable data structures because immutable data structure share structure, copying an imut an immutable data structure is as cheap as basically copying a pointer and maybe touching a

reference count.

You can just capture the whole document by value in your lambda and save it to disk and you don't need any synchronization. Right? That's pretty

synchronization. Right? That's pretty

neat I would say.

But another advantage of immutable data structures that I think is sometimes um uh under appreciated is reasoning about

change. What do I mean by this? In a lot

change. What do I mean by this? In a lot of interactive systems and the systems that I often work on, you have some kind of document know some kind of data model

that the user manipulates. So you have actions that change it. And a thing that you need to do very often is to figure out what changed right so I did

something to do the comment what changed and in this way we can do updates in other systems like for example the UI needs to represent that change to the user right u if you're making music

software like I often do you need to represent that change in the audio graph or whatever is generating the sounds so if you're using a mutable traditional

system well you have the document and the document is mutating all the time, but the previous values are not persisting. So if the previous values

persisting. So if the previous values don't persist, you actually don't have information or not even a way to kind of recover that information about what

changed. You change something and that

changed. You change something and that change is described in the code, but it's not described in the data that you have, right? And it is because of this

have, right? And it is because of this that we resort to techniques like callbacks that as you go modifying things, right, the code is going to explicitly say, well, I just changed

this thing, right? I just changed the property fu of the document or I just changed the property bar of the document and someone else is listening to this and trying to make sense of something like this, right? This means that when you change the document, actually, you

probably need some kind of helper function that hides the notification.

But for me, one of the most the biggest problems with this technique, there are a few other problems, but like the biggest problem is that it doesn't compose. So now if I have a complex

compose. So now if I have a complex operation that needs to manipulate multiple properties of the document at once, well, I can do set full set bar.

And you will say, well, no problem.

Well, the problem is that the the user the user meaning like the clients of this API um are seeing all the intermediate updates, right? uh they may

even see broken invariants. Let's say we have the invariant that fu uh has to be less than bar, right? Uh we broke the invariant when we did that first set set

fu, right? And uh the UI may break or

fu, right? And uh the UI may break or now have to consider the possibility that the invariant has to be looser.

Basically, you have to loosen your invariance in some way. And as your application gets more and more complex, this leads to to to a lot of complexity.

no wasted updates and often you know like callback loops and the spaghetti code monster that takes over our code

when you're using uh immutable data structures value semantics do compose right first actually I would argue that you don't even need to um well here I'm not even using the mutable data

structure I'm using integressor right but um you you don't even need often like this kind of accessor methods no you can just manipulate the value because you have your copy here inside

the function. You took it by value. You

the function. You took it by value. You

can do whatever you want here. Nobody's

going to see it. And then you produce a new value and that's what other people may see. Right? So from the point of

may see. Right? So from the point of view of the users of this function, this is like an atomic operation. No, it's

like it's not atomic in the concurrency uh sense, but it's atomic in the sense that um nobody can see what's what's inside, right? And you can make other

inside, right? And you can make other operations that are based on these complex operations and you can grow your operations more and more complex forming a tree and where where they compose

right and nobody has a problem with what's made of what and because things persist you always have the previous value if you want right which means that

when you need to figure out okay what changed you can just do a comparison right you can say is fu different than uh in the in the previous value than in

the last value. And one key thing here is that uh comparing vectors and comparing maps in etc is very cheap because if they didn't change the pointer values internally know of this

treat structure is going to be the same right or if the shared structure the equality operator is going to you know bypass all those parts that are shared.

So this works very fast, right? And you

may argue that it's not so nice to have to, you know, do these comparisons manually. Uh there are other things you

manually. Uh there are other things you can do. There's another library lagger

can do. There's another library lagger that I've implemented. There are other talks about it where you have these things called cursors that do actually a lot of these um comp comparing and diffing the tree for you.

There are other cool things that you can do with the data structure um which is that this is not implemented for vectors but it's actually implemented for maps sets and tables which is another data

structure we have based on the uh hash um hash arrays which is when you have a a map and you create a new map that has

changed a little bit you can actually use immer to give you the differences between the two values in a time that is proportional to the amount of change.

Right? if they have been derived from one another, right? So, it leverages the structural sharing to um be able to implement this in a in a very efficient way. So, it can tell you, you know, this

way. So, it can tell you, you know, this thing was added, this thing was removed, and this thing was changed. So, this is you uh super useful. Uh I use this uh all the time, particularly when writing

software with UIs that uses uh EMR in some way.

And the other fancy thing that uh you can do with uh immutable data structures is time travel. What I mean here is uh let's say you want to implement ano

history, right? Uh all editing

history, right? Uh all editing applications these days have that even though we're losing that a little bit with mobile. Uh but they used to have

with mobile. Uh but they used to have this where you can uh always, you know, go back to the previous state of the document after you change something. And

the canonical way to implement this it's already you know in the Gango 4 document is the command pattern where the undo history um has a collection of objects that they're called commands that

actually implement the change that you want to do to the document and they implement it twice. They implement it forward uh in a method perhaps called do and then backwards in a method called

undo and then you have all kinds of implementations for this uh for all your operations.

Um this is nice um in the sense that it gets the job done but it has some disadvantages. One of them is that you

disadvantages. One of them is that you have to implement the operation twice and you have to make sure that and do is the inverse of do which maybe not so trivial to do in some cases or open the

doors uh for bags. Um, another

disadvantage is that you cannot jump between different places, right? So, if

I want to undo 30 steps, I have to undo them one by one, right? I cannot just jump to 30 steps ago, which you know, most of the time we just control set or control Y. So, you don't experience this

control Y. So, you don't experience this experience this. But, you know,

experience this. But, you know, sometimes it will be nice to be able to just jump in time. Now, if you're using persistent data structures, your undo history can just be a collection of

documents, right? That's it. Because of

documents, right? That's it. Because of

a structural sharing, you're not going to waste a lot of memory by keeping all the copies of what happened before.

Now, we get to a very sad moment because I had a very cool demo uh prepared here, but my computer didn't like the HDMI cable or something. Uh so, I'm just

going to describe what was going to happen here. I know the demos are the

happen here. I know the demos are the [laughter] best or the most memorable parts of my talks. Um, I wanted to say, yeah, that the proof is in the pudding, right? So, I wanted to show you a piece

right? So, I wanted to show you a piece of software that is definitely not trivial, that uses all of what I'm talking about, right? It's not a toy software. It's actually software that

software. It's actually software that we're making at bronze. And in bronze, we're making a system to make um music that is nonlinear and that is adaptive.

This means that whenever you press play, you can get something slightly different. Uh, we don't do this with

different. Uh, we don't do this with like crazy Gen AI stuff. We do this by allowing the artist in a piece of software called the bronze composer to

create uh nonlinear musical structures that has space for variation using probabilities in different levels uh of abstraction.

And that's one side the bronze composer software. The other side is we're making

software. The other side is we're making a new audio format and a very portable audio engine that you can integrate in different places. Right? Uh one of the

different places. Right? Uh one of the use cases actually that we're really exploring now is video games, right?

which is very um natural use case for this because you have music that goes on forever and that needs to kind of adapt to the environment and this is all stuff that you can achieve with bronze and in

bronze uh you have I have a screenshot here at least uh when you open a bronze piece uh you can see this graph this graph describes the structure of the

music at a high level because it's composed of sections where each section has certain probability of going to another section or not, right? And these

sections can be displayed in the upper part that uh you cannot even see the part of the screenshot here, but um

um what I was saying is that in this upper part, you can load the sections and visualize them, play them back. You

can even fill let the engine fill it up for you as you play the song. So it will load sections into this kind of work workspace using the probabilities that are described in the section. No. So you

can hear the music as the user will uh hear it and you can visualize it here and you can edit it in there. And I was going to do this fancy presentation where using this generative stuff uh we

could like make a song that is like hours long and you can like select and move around uh thousands of uh regions and clips from sections to others and

you could take the undo history and hover over all the elements and as you hover uh that state is actually presented to you and you can even hear it right and everything responds

immediately you know zero latency anywhere everything is super fast. Um,

which is uh pretty cool. But anyway, you have to take my word for it. Now,

now this is really cool, right? Uh, but

if we just stay here, there will be no talk. There are some challenges that we

talk. There are some challenges that we need to face, right? Ominous music. We

get to the part two of this talk, which is persistent persisting persistence. uh

because what I described before some of you may have seen already before I've presented it in other talks about immutable data structures but now we're going to face the challenge of persisting persistence and we're using

now the term persisting here in a cheeky way in this double meaning where we talk about persistent data structures but we're also talking about persisting the

data structures to disk.

Now for educational purposes, we're going to to uh turn our beautiful radics balance trees into binary trees. And you

can actually do this by yourself in the library by fixing the B factor that is uh given to you actually accessible uh in the as part of the template

arguments. And this can be sometimes

arguments. And this can be sometimes useful, you know, to um optimize the data structure for your particular use case. Uh we're going to set the B

case. Uh we're going to set the B factors to one, right? which means that we have two to the one uh elements in each vector and this will simplify the display uh of the things that we're

going to show. Now let's say we want to store our own history to disk, right? Uh

because I showed that very simple representation that works really well in memory. You know why the structural

memory. You know why the structural sharing but what happens when we want to save it to disk? A lot of software these days actually have this feature where you can, you know, load a session and you can still have the undo history that

you had before. Another reason why you would want to do this is to have infinite undo, right? If you want infinite undo, at some point you may run off uh memory. So you want to continue

saving your undo history to disk. Um

still not technically infinite, but maybe uh you will die before you can fill up your uh hard drive with undo steps. [laughter]

steps. [laughter] Um all right.

So how will we do this? Uh let's try it first in the let's say intuitive way or the naive way. We have our document with uh and

way. We have our document with uh and it's a very simple document here. It's

just a document that contains a vector of strings. Uh and we call this vector

of strings. Uh and we call this vector of strings data. Right? So um if we want to save this we're going to use the serial library uh because it's

convenient. I use it in many projects.

convenient. I use it in many projects.

Um, and to serialize uh a strct with serial, you just have to define this method called serialize. Um, that can work both ways for saving and loading

where what you're doing is basically describing how you iterate over your structure, right? Uh, it takes an

structure, right? Uh, it takes an archive as the first argument.

Uh, it's templated because it can be a different thing depending on whether you're doing JSON or XML or binary. And

you take your value and you call the archive with your data, right? And in

this case, we use this MVP MVP thing means a named value pair. Uh so we give it a name. So if you're using a format like JSON or XML where you actually can see names in the format, this is the

string that it's going to use for that.

Now we also need to save the email vectors, right? Um there is support for

vectors, right? Um there is support for std array and std vector and all that stuff uh in serial but there is no support for emer so we can add it

ourselves. Um and to do that uh we tell

ourselves. Um and to do that uh we tell it to serialize uh the size using this special size tag. This is what will make

us get a um an actual array in JSON for example as opposed to an object. And

then we iterate over every element.

Sorry, this is for loading. We iterate

over every element. So we loaded the size to know how many elements there are there. Then we iterate over every

there. Then we iterate over every element. We load the element and we push

element. We load the element and we push it into the inner vector. There's

something interesting here um happening which is that I'm moving the vector before calling push back on it. And this

will make this function work way faster than if you didn't have the move because here we don't care to persist the intermediate values of m right [clears throat]

and if we move it um does know that these intermediate values they're not visible to anyone and will actually do some of the mutations in

place and this will be way faster. In

fact, filling up uh immer vectors using this pattern is similarly uh efficient to filling up uh std vectors with the difference that here you don't have

sometimes like a big reallocation step.

It's a bit more constant. Uh there was a question there.

Where's the index I the index I I the index I is the index uh the number of the of the elements

that I'm loading. So I I load the size.

So how how many how many vectors sorry how many items that this array have or this vector and then I iterate uh that number of times.

Yeah.

Yeah. Uh when saving um I need to first save that size right uh with the same symmetrically we use this uh size tag thingy and then I use email for each and

I just uh give the archive to the email for each. This is also uh something not

for each. This is also uh something not worth uh noteworthy. Why is there email for each when we already have std for each in the standard is because email

for each does leverage the does iterate using uh internal iteration I think is the tech the technical term it goes inside the data structure doesn't use iterators so it's more efficient right

because it it traverses the tree directly without having to if you use iterators and see it's a tree you need to kind of store the stack of the iterate of the thing in the iterator.

It's not much slower. Often I just use a for loop just because I'm lazy uh or I don't want the object size increase that this may introduce. But if you want to

be super fast, use image.

Uh so now for illustration purposes, we're going to create three immer vectors of strings. That's the data that we had in the documents. Right? So if we go back, we had this document,

sorry, a doc, the document was a vector of strings, right? Or contains that. So

we're going to create uh these three vectors. The first one contains four very simple strings. Then we add two new strings to a vector. And then we

have a third state that is actually the same. Right?

same. Right?

Now the history is going to be a vector of documents. Right? So we put all of

of documents. Right? So we put all of this into a vector and we serialize this uh the serial way which is you create the archive and you call the archive

with your value and we get this right.

So uh this value serial thing it's like an artifact of uh serial because it doesn't know what's the name of that root object that you're giving it. Um

there is a way to get rid of it. Uh we

don't need to go through that.

This works. It's nice. But there is a problem with this, right? Uh the problem is that these values are independent.

Let's uh think about it again. If we

represent the three values that we had, they can be represented very compactly with this tree. So we had V1 that has two leaf nodes, right? Because it's a

binary tree with two elements each. Um

then we had V2 that had two extra elements. So it adds a root and those

elements. So it adds a root and those things to the leaf. It doesn't waste any space. No, it only does uh allocates the

space. No, it only does uh allocates the new data that is necessary. And we had V3 that is actually identical to V1. So

sorry to V2.

Instead, when we save, we get three completely independent vectors, right?

Because serial doesn't know about the internal structure of emer.

And this means that when we load, we're going to get three different trees, right? We lost this beautiful structural

right? We lost this beautiful structural sharing and if we did this with a very long undo history, this is potentially going to blow up our our memories, right? This is not fun. Actually, the

right? This is not fun. Actually, the

file already will be potentially really big, right? Because you will have all

big, right? Because you will have all these like arrays with duplicate information where you know A B C D is listed four time sorry three times in

the JSON file when in memory each letter uh appears only once. So this is not nice. Can we do better?

nice. Can we do better?

Well, the way we're going to do better is by introducing the notion of a pool because the problem we have is that each vector is treated completely

independently. Right? Serial doesn't

independently. Right? Serial doesn't

know about the fact that there are some vectors that share structures with others. So, we are going to introduce

others. So, we are going to introduce this notion of a pool of vectors.

Now if we have a pool of vectors, maybe we can teach serial to look into what sharing happens between the the vectors within the pool and actually serialize

in some kind of JSON representation the actual internal structure of this pool.

Then we can load the pool as a whole and out of this pool we can reconstruct the vectors with the shared structure that we wanted. This is pretty nice and this

we wanted. This is pretty nice and this is what this new new library that I'm presenting today that's called emer persist um it's uh in the emer

distribution uh on master um it's a little bit experimental so uh I will uh mention at the end some of the caveats but it's it's usable it works and all the code that I'm presenting today it's

actually runnable code so how can we use immer persist to for our documents the first thing that we're going to is get rid of that serialize function

and we're going to replace it by boost hana adapt strct. Uh now if you're using C++ 26 reflection of course this is not needed. If you're on C++ 20 you can even

needed. If you're on C++ 20 you can even use boost pr to avoid this. Uh in some cases I'm a little bit old school so I

use boohana. Um maybe I could even use

use boohana. Um maybe I could even use boost uh fusion at abstract actually. Uh

Michael and me were mentioning yesterday that we miss the old school uh template meta programming stuff. This morning

actually I I found out a very good explanation for why that's happening in our brains. Um if you're curious ask

our brains. Um if you're curious ask about it in the questions. It's a little bit maybe too much. Um

what HANA adaptstruct does is that it uh informs boost Hannah which is a meta programming library that gives you compile time data structures um

about the strct right. So it tells it which are the members and all of this stuff. So then we can at compile time

stuff. So then we can at compile time get this information and implement the serialized function for example uh in a generic way. So the serialize

now doesn't have to be customized for a document. We can say for every boost

document. We can say for every boost HANA strct. Um to serialize it you

HANA strct. Um to serialize it you iterate over the members using HANA for each uh you get the keys. These are kind of the identifiers of the members of the

strct and then you use this uh serial MVB getting uh what's on that key right?

in.

[cough and clears throat] Now once we have this um we can go back to our previous example. Right? So we

have the vector of strings uh we push those two new elements and we have the third copy. Right? So the data is exactly the same. We have the history

that is exactly the same. But to save it, we're going to use serial save with pools, which is a function from immer persist that leverages serial to

serialize all the data structures that happen in that document tree um using pools. What's cool about it is that actually you don't even need to

specify what the pools are. It will

figure out what the pools are. It will

combine all the vectors that have the same type. Um and then there is this

same type. Um and then there is this policy thing the booana hanastruct automember name policy which means use booana precisely to figure out which

pools there are in the thing. Um give it some names automatically and just get it working right. So if we do this and

working right. So if we do this and serialize now we get this beautiful maybe not so beautiful JSON uh it's definitely less human readable but

what's nice about this is that we do have the structural sharing in it. So if

we look at the value zero thing, value zero is the actual um the actual history right the history which is an array of

three um of three documents right and what you can see is that the vectors that are containing the documents they are not referenced directly right we don't see

an array expanded there it just gives it an ID and you can already notice that the vector 2 and the vector 3 which are the same vector they have the same ID Right?

Then we have the pools and the pools contain all the pools that are needed for our document. Now, since in our document we only had one data structure that was called data, it already gave it

the name data. Um, this is sometimes convenient. Sometimes it's a little bit

convenient. Sometimes it's a little bit confusing because if you actually have vectors of the same type in different parts of your document, it picks up the name of the first attribute that uses

this this vector as the name. But you

can customize this. Don't get too hung up on this. H it actually serializes the template parameters, right? Because um

it's not the same a vector with B1 than a vector with B2. And uh if you mess this up, the structure will be different. You could up

different. You could up overloading. So this allows us to check

overloading. So this allows us to check are we the right format uh when we're loading.

Now we have the leaf nodes. And you can see in the leaf nodes there's no duplication of information, right? We

only have the three leaf nodes that we saw in our picture before.

Then the vectors, there are only two vectors, right? Because v2 and v3 are

vectors, right? Because v2 and v3 are the same. Uh so there only two. And then

the same. Uh so there only two. And then

there are two inner nodes. Now this is actually a slightly different than what there was in the picture because there is this tail thing here that watch the

other mutable structures. uh uh talk if you want to know more about it or I can explain in the questions if we have time. Um it's an optimization uh that

time. Um it's an optimization uh that the data structures have to change a little bit the shape of the tree but for the most part it's the same that we have in the picture right um and this is nice

uh we now have structural sharing in disk.

Uh here is a little overview of what happens in this policy thing. Uh the

policy thing is what does some of the magic that allowed it to be very automatic but you can also do it manually in depending on your application you may want to do some of this first you say what are the pool

types right so this is like a boost hana collection right a compile time thing where you say well I want a pool for the vectors of strings right I want a pool for uh this and that and maybe you want

to skip uh some data structures that because of the way they're used in your um in your application the structural sharing is not relevant uh you can skip them there then you can give them names

and this is the name that you're going to see in the JSON right so we saw we call them data there we can call it something else um and then you can also

customize the save and load for the root value uh so you don't get that value zero thing you can give it a different name like for example history okay really nice so

We managed to get persisting to disk with this pool concept. Can we do something else that is kind of cool?

Well, saving to disk is a transformation of the pool, right? You have a memory representation of the pool and we

transform it to a JSON representation.

We could do any arbitrary transformation on these pools, right? And this is actually useful and it solves this problem. Let's say you want to transform

problem. Let's say you want to transform uh the strings of all these vectors that you have. Right? So you can have a

you have. Right? So you can have a transform email function like this that uses std transform and uses this trick of getting a transient. The transient

just gives you a uh mutable API instead of having an immutable API that is compatible with the standard library right back in stuff like that. So

[snorts] we transform the vector we get uh new vectors uh using a function right. So for example we convert

right. So for example we convert um these three vectors that we had to uppercase.

Now if we do this we have the same problem that we had with the serialization problem. Right?

The transform function doesn't see the internal structure of the vectors. is

going to iterate over every element of every vector and just you know very damp and forward thinking um go and and transform every item and

these things that were before interdependent in their representation suddenly become independent, right? And

we waste memory. We also waste work because this transform function is actually transforming elements that it could have transformed only once, right?

But using pools we can apply a transformation over the pool and get a transformed pool where only the elements

themselves have been applied uh this change right there is a very uh important use case for that in bronze which is actually what I'm going to show here which is uh

how we deal with versioning of documents.

Um there is no perfect way to deal with versioning and all serialization libraries they try something else. Very

often you end up like in your serialization code reading like a version string and doing one thing or another depending on what you do uh of what version it was. I don't like that because it's kind of errorprone like if

you up like in how you're actually interpreting the old version um you may not notice. What we do in bronze is that

not notice. What we do in bronze is that actually for every version of the document we just have C++ types that describe the schema in C++ right because we're using serial or busana um for

serializing them and if we introduce a new version of the document for example instead of a ve a ve a vector of strings it's going to be a vector of charts um

we just copy paste the things that change we make the change and then we have code that actually leverages busana to do most of the conversion automatically. So it traverses the tree

automatically. So it traverses the tree um and only for the things that actually change, you need to kind of write your little conversion function that upgrades, right? And if you forget

upgrades, right? And if you forget something, you're going to get a compilation error. So there's kind of no

compilation error. So there's kind of no way to kind of up um the the versioning unless you modified an old version header, which you should never do, right? In a code review, it's very

do, right? In a code review, it's very easy to see, hey, you modified the V1 header. Uh why did you do that? No, you

header. Uh why did you do that? No, you

have to create a V2 header, copy, paste, and do the change. So uh I like this but this has that problem of like we [clears throat] are transforming between versions of the document and we're kind

of destruct uh destroying the structural sharing if we use the naive approach.

So how do we achieve this? To achieve

this we're going to write a transformation that applies uh describes the transformations that we're going to do on all the pools of the document actually.

So for this we have a booth hana map that contains entries for each type of uh data structure that I have in the document.

Here for the vector of strings I'm going to convert them to I'm going to convert the values right because uh that's that's a key thing here that we're doing right we're doing a transformation on the values of of the vector and that's

why we can preserve the structural sharing. If you were doing something

sharing. If you were doing something else like a filter, you cannot preserve the structure, right? Because the

structure will be different. But if

you're just transforming the the values, um you can preserve the structure and you describe the transformation in this way. Here we're doing a very stupid

way. Here we're doing a very stupid transformation that doesn't really make sense, but anyway, which is we're just getting the first element of the first character of the string, right? We said,

oh, we were using ABC D. Why don't just represent it with the first character?

If there was no character, for safety, we just use, you know, a null character.

Um very stupid transformation but highlights uh what we want to do.

Uh so yeah that's the type of the vector. This is a transformation.

vector. This is a transformation.

You could have many of this right in a practical application where where you have complex document with lots of different pools. Now we have again the

different pools. Now we have again the same example uh with our three vectors our history that contains the three vectors. How do we transform this

vectors. How do we transform this history?

Well, first we're going to say, give me output pools. We distinguish between

output pools. We distinguish between input and output pools. Output pools are pools that you can get values from, right? And we need to get the values out

right? And we need to get the values out of it to transform them. So, this is an output pool. Now, we say transform the

output pool. Now, we say transform the pool using the X form, the transformation that I just gave you.

With this, we get another pool P1.

And now, we can transform the history itself, right? uh for the history

itself, right? uh for the history because there's no structural sharing.

There is only one history, right? We're

going to transform it using in this case uh standard ranges just to be a little bit fancy. Uh and what we have as an

bit fancy. Uh and what we have as an argument is a version one document and we want to return a version two document and we do that by converting the container

uh using the two pools before. You

actually need to pass the two pools because uh well the whole thing happens lazily actually. So when you ask the

lazily actually. So when you ask the transformed pool for a new pool, it's going to look up you know in the previous pool if uh it's transformed or not transformed. Uh and you pass in the

not transformed. Uh and you pass in the container that you want to transform and then you get the transformed container.

So data here in the argument right is the vector of strings. The data that you have in the variable is the vector of charts that you can return and finally

get the new history updated. And if I serialize to disk and to just show an internal structure, you can see that

it's basically identical to uh the other Oh, sorry. [laughter] I was uh waiting

Oh, sorry. [laughter] I was uh waiting for one. You can see that it's identical

for one. You can see that it's identical to the to the previous um document uh or the previous history. The only different that's the only thing that's different is that the leaf nodes they don't

contain in strings. Now they contain charts and the charts are serialized as the code of the character, right?

Uh so yeah, this is pretty cool I think. Uh

and it gives us uh lots of things. There

are more things I can talk about the library. I'm just going to go through

library. I'm just going to go through them uh very briefly. Um I showed how to transform vectors. Uh the hash tables

transform vectors. Uh the hash tables and sets also work. You have to be a little bit careful though because if you're saving to disk, you need to make sure that the hashing function that

you're using is portable, right? You

cannot just use std hash and get a different result on Windows or Mac or even you upgrade the compiler and maybe you get a different result and then your documents will stop loading. The hashing

function has to be stable, right? So, we

provide a grabber for xx hash uh which we found it to be kind of nice. Uh you

could use something else. You can

convert also the key types. So if you have a map, you can actually convert the key type as long as the hash is the same or as long as they're equivalent actually. Um

actually. Um this is useful to have. We have it in our document because often you know we have um type safe keys for tracks and regions and sections and when we change

versions the key type actually change but the actual information that's underneath uh is equivalent. So it's an important feature to have but you have to use it a bit careful anyway when you try to convert something by default it's not going to let you do that and then

you have to tell it like please uh I really want to convert this key type um it's safe also noteworthy complex data models work

I mentioned already if you have lots of um data structures in your tree all these functions they're going to collect all the data structures and um com pile

them up into pools uh per type uh which is really nice. You can have as much sharing in the structure as you want in different parts of the tree. And one key thing that was actually very important for us is that recursive data models

work. And this was actually quite tricky

work. And this was actually quite tricky to get working because you can imagine when you're loading something and it's recursive and the information is kind of in the pool. Um it's not that trivial at all. But this was important for us

all. But this was important for us because we're working on a new feature where actually you were seeing like track with tracks with clips. We have

this feature where you can actually have an arbitrary timeline in one of these clips that expands to other tracks with clips and you can have more tracks with clips in any of these and this can grow

recursively which is very expressive um and pretty nice but it was a little bit of a headache for this feature but it's actually the original motivation that we had to implement email persist because uh we were not supporting persistent

undo history but we wanted to have that feature and with that feature it's very easy to kind of create a very complex uh clip clip in the timeline, right? And

then you just duplicate it around to create loops and move it around and all these duplicates you want to preserve the structure, right? So, this was very important for us.

Um, finally, as I mentioned, the library works, but there are definitely some to-do items that we have. Uh, the first one is incremental processing, right?

So, at the moment, we save the uh collection in one step or the pool in one step. for the ando history case for

one step. for the ando history case for example it would be really nice to be able to save just the differences that have been introduced in the pool by this new document right so so you save the

and steps as they come in incrementally right and we have some ideas on how to implement this but we haven't done it yet um I think binary formats are not working yet uh it will be nice to to get

them working for some use cases where you want a more compact representation which is also something that we can work on it was quite worthy the JSON that you saw before, right? It was, oh, you're sharing structure, but you actually have

way more characters. Um, we can actually make it more compact, but I thought, you know, actually for the slides, it was nice to have some of the things spelled out. Um, serial independence. Now, we

out. Um, serial independence. Now, we

depend on the library serial. Um, I

think we don't fundamentally need to. It

would be really nice to have some kind of hook where you can use your own serialization library. and buildtime

serialization library. and buildtime object size when you have lots and lots of pools uh particularly in our case where for every version of the document we have our own set of pools uh

buildtime and object size get pretty big I think there are definitely ways we can improve on this because um part of the problem now it's all templated right so for every pool you get new serization

code for everything but the for example the internal nodes of the data structure they're the same regardless of what the type of your data structure is Right. So

all of this could be kind of type erased and uh using other techniques I think we can definitely improve both on build time and object size. Um also very important to thank Alex Shabing. He

actually did uh basically 99% of the implementation of what I presented today. Sadly he couldn't come here to

today. Sadly he couldn't come here to CPPON so I showed it uh myself. Uh but

uh yeah he's an excellent developer has been working uh with bronze. He's also

an independent contractor. If you want to hire him uh for anything uh I can definitely vouch for him. He's a a really brilliant developer.

Now key takeaways. So structural sharing very important property of immutable data structures. They enable completely

data structures. They enable completely new architectures in our software. And

with email persist you can leverage structural sharing across sessions of your application by storing it to disk. you can transform the structures and preserve the structure

sharing. Uh and I think this has value

sharing. Uh and I think this has value in a lots of applications. I think it was yeah a key thing that we were missing uh to really be able to to use this value semantics at scale at really

big scale. Um also I will say help us.

big scale. Um also I will say help us.

So uh bronze is a very small startup which sponsored most of the development of this persist stuff. But if you find this useful for your work please hire Alex or any of us to continue improving

on it. Um it's an ongoing progress.

on it. Um it's an ongoing progress.

There are other companies that have contributed to EMR in different ways. Um

and yeah, it's beautiful that with open source, we can all uh contribute a little bit either financially or with code. If you guys want to uh try

code. If you guys want to uh try implementing some of uh the things that I've described or other things that you need for your own use case, please be my

guest. Uh a list of pre previous talks

guest. Uh a list of pre previous talks uh on this topic. I'm not going to go through them now because we're running out of time. I just want to say thank you all for listening. I'm going to

leave it. Some people were taking

leave it. Some people were taking pictures of it. Uh thank you all for listening. Uh and uh if you have any

listening. Uh and uh if you have any questions, it's the time to ask. Thank

you. [applause]

[applause] Yeah. Hello. Very nice talk. Thank you.

Yeah. Hello. Very nice talk. Thank you.

I have two questions. So first uh you mentioned that the the pools are detected automatically uh and no what is detected automatically the pools the pool I mean the pool yeah

that what does involve that I mean it's like uh it's a costly operation to detect uh the pool well it's it happens at compile time uh so that's why it uses booth hana and since we had booana then

we could get rid of the serial code and yeah we use boohana to traverse your document figure out all the types and combine them into these types. So

there's no runtime um runtime pay payment here. There's some payment on

payment here. There's some payment on compile times uh which depending on on your document it may be relevant or not but uh yeah but let's say you have three immer vectors like in your example the library

has to detect that they have share information.

Oh you mean that right? Um that happens yeah as you serialize. Um

Uhhuh.

And so there's a bookkeeping in the serialization. Yeah, that's part of like

serialization. Yeah, that's part of like the serialization. So, uh,

the serialization. So, uh, and well, it's part of Yeah, it's part of creating the pool that is then serialized. Um, [clears throat]

serialized. Um, [clears throat] it's not part of serial.

It's not part of serial, right? Exactly.

It's part of EMR or EMR persist in particular. And yeah, as it goes, it

particular. And yeah, as it goes, it kind of as it's kind of trying to serialize things, it finds the commonalities, etc. But there is no um it's kind of proportional to the amount

of stuff that you're actually writing to this. So proportional to the amount of

this. So proportional to the amount of change that you have in of the size of the actual final pool. So so there's no like secret quadratic crazy thing in there.

Second short question is these EMR vectors support iterators that yes like normal iterators that can be you can iterate them normally. I showed

the immer uh for each uh which can give you a bit of extra performance boost where you need it but you can also use normal iterators and I use them all the Is there any STD algorithm that you cannot use at all because it's so

expensive or all of them work?

Um, there are algorithms that you cannot use at all because it's not even compatible.

Um, and that I don't recommend applying directly. For example, one key example

directly. For example, one key example is sorting, right? When you sort a vector, you need to actually go through the whole vector anyway, and you're producing a completely independent one,

right? and um you need to do a lot of

right? and um you need to do a lot of random access. If you do this on a emer

random access. If you do this on a emer vector, you're probably gonna directly you're going to be kind of paying the price. So what I do in practice is I

price. So what I do in practice is I just copy to STD vector sort and then convert to to vector.

So the category that you gave to the iterator is not random access. You put

something like forward by it's random access. Yeah. But it's not mutable. So that's why you're going to

mutable. So that's why you're going to sort with it.

Oh, that's what prevents the sorting.

Okay. Thank you.

Yeah.

Hi uh thank you so much for the talk and uh more than that is like a actually in our team we are like uh extensively using humor.

Oh um it's like a big we it's like a building block in a very large scale application. Um I got a question is like

application. Um I got a question is like um so if like say we use um in a composite way like say emer of emer as a

value type and all the way down and if somehow the value type has to be a pointer then what would be your recommendations like say how are we

going to uh deal with that you mean in the context of im persist uh in the yeah in the persist in persist and the transition if it's um The value

type is a pointer, right? So when we say get the value, if if we want to get a value and try to make changes, we want to make a true value copy instead of a shadow cop with a pointer.

I'm not sure I'm fully understanding like like if you're using emir, you're going to have a vector of vectors you're saying for example, right?

Like from the point of view of your code, it's a vector of vectors. There's

no pointer is the the pools that going to actually Yeah. The value yeah the value itself

Yeah. The value yeah the value itself it's not like a vector of vector but the vector of a pointer to a vector something like that okay

I see and and you want to see you want to know how to translate these pointers that you're using yeah so once we get say transition at that moment we actually want to enforce true value copy

I see uh I guess yeah so you will need if you're using this pool thing right you will need to translate those pointers which is what the thing is doing under the hood right it's translating the pointers to those numbers and then getting them back. So I

imagine there will be probably a way to do this. Uh I don't have it now in the

do this. Uh I don't have it now in the top of my head. I'm not sure it's actually you can get it out of the API that we have now. But I think conceptually it's something that could be added, right? Because that's part of what the thing is doing. It's

translating.

Yeah. Right. Right now we have our own way to try to overload it to get away with it. U but I just want to see if you

with it. U but I just want to see if you have any like recommendations what be the best. I'm not sure I'm fully

the best. I'm not sure I'm fully understanding actually the the problem.

Maybe you can show me with code after and and we can see if there is like a better Okay, cool.

Thank you.

Thanks.

Can I last question?

Yes, please.

Um, so I'm very interested in iterator categories in in the context of the STL.

So if you if you had all the power in the world to you know adapt the STL to your needs, do you think this involves like a new kind of intermediate category

for iterators that you could exploit?

Like for example Q uses like fragmented segmented iterators, you know, in a hack way.

Um maybe yeah, I haven't really I haven't really thought of that. That could be a thing. I must admit that I have I

thing. I must admit that I have I haven't really prioritized uh uh STL compatibility. I was actually considering that we could have a v2 of emer

that didn't use mutable APIs. Sorry that

that didn't use immutable APIs by default because in the end you can just have a mutable API which is what the transient does and use copy and write. That's

actually the interesting thing is the actual structure that is underneath and you can use copy and write and get exactly the same benefits. Yes,

I'm using immutable APIs just because I was coming from the closure world and it kind of encourages a way to think about things that is a bit more more value.

Um, so there are things that we can do in EMR to bring you closer to the STL stuff.

Uh, I could imagine that there are still things that you would need in the STL like yeah for example if you're changing the vectors okay we're doing copy and right uh you're not getting true references to things. It's a little bit

of like a vector of wool case, right? So

maybe we will need to tweak some of the actual iterator categories and not enough of like a STL lawyer to know be able to tell you from the top of my head which changes you will need.

Um but yeah, it will be if someone wanted to have that that will be an interesting problem to discuss. I mean

I've actually sometimes people ask me like maybe this will be something nice to have in the standard right this this type of container actually because I use this in all my projects now actually but

um I I find the the idea of like yeah making it make sense with the rest of the standard two daunting so I I I haven't even tried but if someone else wants to try also they they can maybe

do this right but yeah definitely what you're saying is what you would need to figure out as part of that work know like yeah how it makes sense with the rest of the STL and the iterator categories Thank you. Thank you so much.

Thank you. Thank you so much.

Oh, thank you.

All right. So if there is no any other question uh thank you very much and yeah see you [applause]

Loading...

Loading video analysis...