Persistence Squared: Persisting Persistent Data Structures - Juan Pedro Bolivar Puente - CppCon 2025
By CppCon
Summary
Topics Covered
- Structural Sharing Enables Efficient Immutability
- Immutable Data Structures Enable Value Semantics at Scale
- Copy by Value Eliminates Synchronization Overhead
- Persistent Undo History as Simple Document Collection
- Pools Preserve Structural Sharing on Disk
Full Transcript
All right, welcome everyone. Uh, excuse
me for the technical difficulties.
Uh, welcome to the presentation persistence squared. And
persistence squared. And if this works, um, [clears throat] so my name is Ju. Uh I'm the CTO of a music tech startup called Bronze. I'm an
also an independent consultant. I've
been working for many years in the fields of music, graphics, films, robotics, real time. And it's precisely because of this involvement in highly
interactive and dynamic systems that I got very interested in persistent data structures. So what is a persistent data
structures. So what is a persistent data structure? When we hear the word
structure? When we hear the word persistent in the context of data structures, we have to forget our intuition that persistent means something like persistent storage,
right? Like somewhere where you save
right? Like somewhere where you save information and that it resists uh rebooting the machine, right? When we
talk about persistent data structures, we mean data structures where the values of the data structure persist after modification. This means that you can
modification. This means that you can change the data structure but whatever you had there before whatever shape the data structure had before is already uh is still there uh for you to access in
some way.
One way and very common way to achieve this is through immutable data structure. Right? We achieve this
structure. Right? We achieve this property in an almost autological way because if something is immutable, if a value is immutable, uh it persists
obviously, right? You cannot modify it.
obviously, right? You cannot modify it.
But some of you may be thinking like cut down on the micro doing or something like [laughter] immutable data structures like the whole point of a data structure is that it
mutates, right? Like it changes. You put
mutates, right? Like it changes. You put
stuff in, you get stuff out.
Hang on for a second. So I'm going to show you the library emmer. So this is a library of immutable data structures that I've presented at CPPCON already a few times before. Actually the I think
the first presentation was in 2017 and not to brag but it's the default uh standard for immutable data structures
uh in C++ uh for C++ developers. And
actually uh it's very production ready.
is used in production systems even critical systems at meta for example we had a project where we really used it in super core infrastructure I know it's
it's used inside Google Amazon uh recently I learned in MongoDB they use it in places as well so it's very solid it's very um well tested you're welcome
to to use it if you like what's presented here so once we have an immer vector uh we can declare a vector that
is const right it's immutable so we can mark it const and we can call the push back method of the vector but since we cannot modify
the previous vector what we're going to get is a new version of the vector right so we had a vector A and now we have a vector B and because the value A
persisted we can do comparisons like say hey actually before we had three items now we have four items right and the two values are there for us to poke around.
We can even do it again, right? So if we want to modify, we cannot use the brackets uh because that cannot like return a new value the way you would expect. Um but we can uh use the set
expect. Um but we can uh use the set method and get a new um vector with some value changed, right? How will we implement this? Well, if you're like a
implement this? Well, if you're like a C++ 23 developer, you may try something like this where you say the class emerctor contains an std vector inside
and we just write wrapper functions like this one or wrapper methods. And here
we're showing off one of the very nice and actually most uh wanted from me uh C++ 23 features which is explicit this right. So what we're doing here is
right. So what we're doing here is saying that this argument of this function we're explicitly declaring it as part of the argument list and in this
way we can take it by value right which is something that you could not do uh in C++ 20 even right there are other fancy things that you can do with this like deducing the type and using universal
reference here we just want to get the vector by value so that uh when I turn my head can you still hear properly? I'm
not sure. Yeah. Um so that we can push back into the member vector. Um but this is a copy right that since we received by value that is already inside this
function logged in and we can return a new version of our data right uh we can do this with the set method as well and we're done right mutable data structures
the talk is ready but of course you will think this this wouldn't work in practice right uh the reason is that a standard vector
uh stores all the data in a continuous vector So that whenever we need to change it in this way, right, if we make a copy of the vector, we're going to have to copy
all the data, right? There's no way to say uh just I don't know modify this little bit here without kind of referencing the rest. Uh we will have to
copy the whole data, right? So if we use this data structure a lot, you know, you're going to get quadratic behaviors and like lots of copies everywhere. uh
you know you're C++ developers you probably are very aware of why you shouldn't do something like this. So
we're going to need a different representation for our vectors. We're
going to use something like this a tree and we're going to use a tree that is very regular. It's perfectly balanced we
very regular. It's perfectly balanced we call here. Uh this is a radics balance
call here. Uh this is a radics balance tree. I will talk a little bit more
tree. I will talk a little bit more about it later. Um what's interesting is that all the blocks all the data is stored in chunks uh that have a fixed
size that is a power of two. So here for presentation purposes I'm using blocks of four elements in practice we will actually use wider blocks. I'll talk
about that later. Now the nice thing is that since now we have a smaller chunks of data that are related to each other.
If we want to modify that vector v 0 that we have represented in that tree, I don't need to copy all the data. I'm
going to actually use this very regular structure to first address directly without doing any search the element that I want to address here. It's like
the index 9. So I can take because I'm using four elements uh the logarithm of four is two or in two. So we can group
the bits that represent um the index of the vector in groups of two. And I'm
going to use the first group of two to index in the root here. Right? So we're
saying we're moving through zero here.
Then we go through uh two here. Uh and then here we have uh I
two here. Uh and then here we have uh I mean two as in C++ index two, right? The
third element. And here the second element J. Right? So here's our J. And
element J. Right? So here's our J. And
now we can copy that chunk. And of
course we're going to need a copy of the parents. Um
parents. Um but we didn't need to copy all the data, right? And this is how we managed to
right? And this is how we managed to modify the data structure. Modify
meaning create a new version, right?
Create a changed version without changing the original one that shares a lot of structure. And this is the key word that you need to remember for this talk is structural sharing. Right? This
is the property of immutable data structures that make them work in a really nice way. Um and that give us really interesting properties. Right?
The fact that we have all these values that are actually from the programmer's point of view purely independent, right?
They're value semantic. They're
completely independent from the interface point of view. But under the hood, they share structure, right? And
we achieve this through this tree structure plus reference counting that you also need to be able to you know track uh when this memory is deallocated or not.
So this data structure as I mentioned is called the radics balance tree and it's characterized by a branching factor m that is a power of two as I said no these blocks need to be very regular
in the slides I was using a branching factor of four in practice I use a branching factor of 32 right so uh the constant B which is actually what you
normally uh use to to define the data structure is five right and it is because of this that a lot of the operations in this data structure Here we call them effective 01 and this is a
little bit controversial. I heard some criticism in the past because of this. I
didn't come up with this term. This term
was actually um invented by the person that made these data structures popular popular which is rich hickey. Rich
Hickeyi is the inventor of the language closure and he made a very bold decision. Uh it's the first programming
decision. Uh it's the first programming language that decided to use this data structure as the default data structure to represent uh vectors in the language.
Right? So if you're using closure and you use a vector, you're getting this data structure. And he argued that this
data structure. And he argued that this is a good representation for a vector because in practice random access in this structure is effective 01, right?
If you know a little bit about algorithms, you will say well technically it's logarithmic, right? Uh
it's logarithmic with this base. But the
problem is or the key thing is that the base of the logarithm is 32, right? What
this means is that with just five steps in you're already in like 40 million elements.
You can with seven or eight element you can address more elements that you can ever fit in any computer that we already have now or any computer that you probably can imagine in the next uh few
decades. So if the maximum depth that
decades. So if the maximum depth that you can go is like seven or eight in the really worst case in the most common case it's going to be three four
Is that number a constant [laughter] or is that number the result of your asintotic uh O? So anyway, you can forget about this effective O1 and say
that it's logarithm, but I think it's um it's an interesting uh idea to think about it in in this way. Um the cool thing about this data structure also is that it's very costefficient, right?
Because uh we're storing data in contiguous chunks. um these chunks are
contiguous chunks. um these chunks are actually wider than cache lines. You can
fit actually a few um cache lines with one of these leaf nodes. So actually
when you iterate over this data structure um in the right way which I will show later uh what's the right way the iteration time is actually
comparable to uh an STD vector right uh there's a very small uh difference um on top of that the actual data structure
that is implemented in EMR is not the RB is the relaxed RB and this gives us the property of logarithmic time in this case real logarithmic time not effective
at one logarithmic um concatenation and this is really cool because this opens the door for things that you cannot do with a standard vector which is inserting in any position of the vector
in logarithmic time. Um this opens the door also for all kinds of new parallel algorithms because for example if you want to do a parallel filter right
normally a parallel filter where you need to produce vectors of different size right so um you cannot really like push in the same vector from different
things without some kind of like locking algorithm. Um with this you can just
algorithm. Um with this you can just distribute the work. Each of the threads is going to you know produce their own filtered vectors on their subset of the work and then you concatenate the
results and get a new vector. This is
quite interesting. Another interesting
property is that growing this vector is not amortized 01. It's actually
effective one. [laughter]
Um and this is really nice because it means that you get a very stable latency and for some use cases this is actually a quite nice property. So yeah this is a
cool data structure that beyond um the immutability aspect it has also other interesting properties. Emer also
interesting properties. Emer also provides has array map trees and this is the equivalent. So we showed vectors no
the equivalent. So we showed vectors no we showed a wide node immutable data structure for continuous sequences. This
is equivalent for unordered maps, for unordered sets. Um, and it's quite
unordered sets. Um, and it's quite similar. I'm not going to go through all
similar. I'm not going to go through all the the differences. Um, uh, the important thing is that it's also characterized by a branching factor of 32. Um, the tree is a little bit less
32. Um, the tree is a little bit less regular because it depends on the prefixes of the hashes, but you also get very um uh very wide nodes with uh
pretty decent um cash locality. actually
better cache locality than with than with a standard and order map because in a standard order map you actually have a linked list uh on every element of the bucket.
Uh again this data structure doesn't require rehashing. So as you grow it and
require rehashing. So as you grow it and add new elements to it you're never going to have you know a stop uh big step which means you get like a constant
or not constant but like predictable latency on all operations. It's also
very compact. Actually, in the project that I mentioned before with meta, um we were replacing some unordered maps uh that were synchronized with locks and we were interested in the data structures
because of the concurrency benefits that they give. But we actually saw
they give. But we actually saw afterwards that we got a 50% uh decrease in in memory usage uh because of moving to this data structure.
This is pretty cool and uh the nice thing about this kind of data structure is that it opens the door for all kinds of new architectures.
It's a kind of architecture that I like to call value semantics at scale, right?
Because we often apply value semantics in little places here and there in our application, but it's difficult to kind of move the big values, the big documents, the big things in our application to pure value semantic.
and value semantics at scale is characterized by I would say by three major advantages. The one is the first
major advantages. The one is the first one is concurrent programming.
This one is the mo the most obvious one, right? Let's say we have a document and
right? Let's say we have a document and this document um is using data structure of some some sort. And if we're using mutable data structure, right, we're going to have one version of the
document that everybody in our application is trying to access. the UI
is like triggering actions that modify it and at some point you want to save it to disk. Well, if everybody's accessing
to disk. Well, if everybody's accessing this and if we want to do this operation concurrently, we're going to have to use some kind of synchronization mechanism.
In this case, I just put a mutex here, but this means that you know this operation is not actually concurrent fully, right? So, um if the saving to
fully, right? So, um if the saving to this takes a while and the user tries to modify the the thing, they're going to get a spinning wheel. It's not so nice.
plus all the problems that mutxes have of like they're a little bit of an errorrone technique. Um, if you use
errorrone technique. Um, if you use immutable data structures because immutable data structure share structure, copying an imut an immutable data structure is as cheap as basically copying a pointer and maybe touching a
reference count.
You can just capture the whole document by value in your lambda and save it to disk and you don't need any synchronization. Right? That's pretty
synchronization. Right? That's pretty
neat I would say.
But another advantage of immutable data structures that I think is sometimes um uh under appreciated is reasoning about
change. What do I mean by this? In a lot
change. What do I mean by this? In a lot of interactive systems and the systems that I often work on, you have some kind of document know some kind of data model
that the user manipulates. So you have actions that change it. And a thing that you need to do very often is to figure out what changed right so I did
something to do the comment what changed and in this way we can do updates in other systems like for example the UI needs to represent that change to the user right u if you're making music
software like I often do you need to represent that change in the audio graph or whatever is generating the sounds so if you're using a mutable traditional
system well you have the document and the document is mutating all the time, but the previous values are not persisting. So if the previous values
persisting. So if the previous values don't persist, you actually don't have information or not even a way to kind of recover that information about what
changed. You change something and that
changed. You change something and that change is described in the code, but it's not described in the data that you have, right? And it is because of this
have, right? And it is because of this that we resort to techniques like callbacks that as you go modifying things, right, the code is going to explicitly say, well, I just changed
this thing, right? I just changed the property fu of the document or I just changed the property bar of the document and someone else is listening to this and trying to make sense of something like this, right? This means that when you change the document, actually, you
probably need some kind of helper function that hides the notification.
But for me, one of the most the biggest problems with this technique, there are a few other problems, but like the biggest problem is that it doesn't compose. So now if I have a complex
compose. So now if I have a complex operation that needs to manipulate multiple properties of the document at once, well, I can do set full set bar.
And you will say, well, no problem.
Well, the problem is that the the user the user meaning like the clients of this API um are seeing all the intermediate updates, right? uh they may
even see broken invariants. Let's say we have the invariant that fu uh has to be less than bar, right? Uh we broke the invariant when we did that first set set
fu, right? And uh the UI may break or
fu, right? And uh the UI may break or now have to consider the possibility that the invariant has to be looser.
Basically, you have to loosen your invariance in some way. And as your application gets more and more complex, this leads to to to a lot of complexity.
no wasted updates and often you know like callback loops and the spaghetti code monster that takes over our code
when you're using uh immutable data structures value semantics do compose right first actually I would argue that you don't even need to um well here I'm not even using the mutable data
structure I'm using integressor right but um you you don't even need often like this kind of accessor methods no you can just manipulate the value because you have your copy here inside
the function. You took it by value. You
the function. You took it by value. You
can do whatever you want here. Nobody's
going to see it. And then you produce a new value and that's what other people may see. Right? So from the point of
may see. Right? So from the point of view of the users of this function, this is like an atomic operation. No, it's
like it's not atomic in the concurrency uh sense, but it's atomic in the sense that um nobody can see what's what's inside, right? And you can make other
inside, right? And you can make other operations that are based on these complex operations and you can grow your operations more and more complex forming a tree and where where they compose
right and nobody has a problem with what's made of what and because things persist you always have the previous value if you want right which means that
when you need to figure out okay what changed you can just do a comparison right you can say is fu different than uh in the in the previous value than in
the last value. And one key thing here is that uh comparing vectors and comparing maps in etc is very cheap because if they didn't change the pointer values internally know of this
treat structure is going to be the same right or if the shared structure the equality operator is going to you know bypass all those parts that are shared.
So this works very fast, right? And you
may argue that it's not so nice to have to, you know, do these comparisons manually. Uh there are other things you
manually. Uh there are other things you can do. There's another library lagger
can do. There's another library lagger that I've implemented. There are other talks about it where you have these things called cursors that do actually a lot of these um comp comparing and diffing the tree for you.
There are other cool things that you can do with the data structure um which is that this is not implemented for vectors but it's actually implemented for maps sets and tables which is another data
structure we have based on the uh hash um hash arrays which is when you have a a map and you create a new map that has
changed a little bit you can actually use immer to give you the differences between the two values in a time that is proportional to the amount of change.
Right? if they have been derived from one another, right? So, it leverages the structural sharing to um be able to implement this in a in a very efficient way. So, it can tell you, you know, this
way. So, it can tell you, you know, this thing was added, this thing was removed, and this thing was changed. So, this is you uh super useful. Uh I use this uh all the time, particularly when writing
software with UIs that uses uh EMR in some way.
And the other fancy thing that uh you can do with uh immutable data structures is time travel. What I mean here is uh let's say you want to implement ano
history, right? Uh all editing
history, right? Uh all editing applications these days have that even though we're losing that a little bit with mobile. Uh but they used to have
with mobile. Uh but they used to have this where you can uh always, you know, go back to the previous state of the document after you change something. And
the canonical way to implement this it's already you know in the Gango 4 document is the command pattern where the undo history um has a collection of objects that they're called commands that
actually implement the change that you want to do to the document and they implement it twice. They implement it forward uh in a method perhaps called do and then backwards in a method called
undo and then you have all kinds of implementations for this uh for all your operations.
Um this is nice um in the sense that it gets the job done but it has some disadvantages. One of them is that you
disadvantages. One of them is that you have to implement the operation twice and you have to make sure that and do is the inverse of do which maybe not so trivial to do in some cases or open the
doors uh for bags. Um, another
disadvantage is that you cannot jump between different places, right? So, if
I want to undo 30 steps, I have to undo them one by one, right? I cannot just jump to 30 steps ago, which you know, most of the time we just control set or control Y. So, you don't experience this
control Y. So, you don't experience this experience this. But, you know,
experience this. But, you know, sometimes it will be nice to be able to just jump in time. Now, if you're using persistent data structures, your undo history can just be a collection of
documents, right? That's it. Because of
documents, right? That's it. Because of
a structural sharing, you're not going to waste a lot of memory by keeping all the copies of what happened before.
Now, we get to a very sad moment because I had a very cool demo uh prepared here, but my computer didn't like the HDMI cable or something. Uh so, I'm just
going to describe what was going to happen here. I know the demos are the
happen here. I know the demos are the [laughter] best or the most memorable parts of my talks. Um, I wanted to say, yeah, that the proof is in the pudding, right? So, I wanted to show you a piece
right? So, I wanted to show you a piece of software that is definitely not trivial, that uses all of what I'm talking about, right? It's not a toy software. It's actually software that
software. It's actually software that we're making at bronze. And in bronze, we're making a system to make um music that is nonlinear and that is adaptive.
This means that whenever you press play, you can get something slightly different. Uh, we don't do this with
different. Uh, we don't do this with like crazy Gen AI stuff. We do this by allowing the artist in a piece of software called the bronze composer to
create uh nonlinear musical structures that has space for variation using probabilities in different levels uh of abstraction.
And that's one side the bronze composer software. The other side is we're making
software. The other side is we're making a new audio format and a very portable audio engine that you can integrate in different places. Right? Uh one of the
different places. Right? Uh one of the use cases actually that we're really exploring now is video games, right?
which is very um natural use case for this because you have music that goes on forever and that needs to kind of adapt to the environment and this is all stuff that you can achieve with bronze and in
bronze uh you have I have a screenshot here at least uh when you open a bronze piece uh you can see this graph this graph describes the structure of the
music at a high level because it's composed of sections where each section has certain probability of going to another section or not, right? And these
sections can be displayed in the upper part that uh you cannot even see the part of the screenshot here, but um
um what I was saying is that in this upper part, you can load the sections and visualize them, play them back. You
can even fill let the engine fill it up for you as you play the song. So it will load sections into this kind of work workspace using the probabilities that are described in the section. No. So you
can hear the music as the user will uh hear it and you can visualize it here and you can edit it in there. And I was going to do this fancy presentation where using this generative stuff uh we
could like make a song that is like hours long and you can like select and move around uh thousands of uh regions and clips from sections to others and
you could take the undo history and hover over all the elements and as you hover uh that state is actually presented to you and you can even hear it right and everything responds
immediately you know zero latency anywhere everything is super fast. Um,
which is uh pretty cool. But anyway, you have to take my word for it. Now,
now this is really cool, right? Uh, but
if we just stay here, there will be no talk. There are some challenges that we
talk. There are some challenges that we need to face, right? Ominous music. We
get to the part two of this talk, which is persistent persisting persistence. uh
because what I described before some of you may have seen already before I've presented it in other talks about immutable data structures but now we're going to face the challenge of persisting persistence and we're using
now the term persisting here in a cheeky way in this double meaning where we talk about persistent data structures but we're also talking about persisting the
data structures to disk.
Now for educational purposes, we're going to to uh turn our beautiful radics balance trees into binary trees. And you
can actually do this by yourself in the library by fixing the B factor that is uh given to you actually accessible uh in the as part of the template
arguments. And this can be sometimes
arguments. And this can be sometimes useful, you know, to um optimize the data structure for your particular use case. Uh we're going to set the B
case. Uh we're going to set the B factors to one, right? which means that we have two to the one uh elements in each vector and this will simplify the display uh of the things that we're
going to show. Now let's say we want to store our own history to disk, right? Uh
because I showed that very simple representation that works really well in memory. You know why the structural
memory. You know why the structural sharing but what happens when we want to save it to disk? A lot of software these days actually have this feature where you can, you know, load a session and you can still have the undo history that
you had before. Another reason why you would want to do this is to have infinite undo, right? If you want infinite undo, at some point you may run off uh memory. So you want to continue
saving your undo history to disk. Um
still not technically infinite, but maybe uh you will die before you can fill up your uh hard drive with undo steps. [laughter]
steps. [laughter] Um all right.
So how will we do this? Uh let's try it first in the let's say intuitive way or the naive way. We have our document with uh and
way. We have our document with uh and it's a very simple document here. It's
just a document that contains a vector of strings. Uh and we call this vector
of strings. Uh and we call this vector of strings data. Right? So um if we want to save this we're going to use the serial library uh because it's
convenient. I use it in many projects.
convenient. I use it in many projects.
Um, and to serialize uh a strct with serial, you just have to define this method called serialize. Um, that can work both ways for saving and loading
where what you're doing is basically describing how you iterate over your structure, right? Uh, it takes an
structure, right? Uh, it takes an archive as the first argument.
Uh, it's templated because it can be a different thing depending on whether you're doing JSON or XML or binary. And
you take your value and you call the archive with your data, right? And in
this case, we use this MVP MVP thing means a named value pair. Uh so we give it a name. So if you're using a format like JSON or XML where you actually can see names in the format, this is the
string that it's going to use for that.
Now we also need to save the email vectors, right? Um there is support for
vectors, right? Um there is support for std array and std vector and all that stuff uh in serial but there is no support for emer so we can add it
ourselves. Um and to do that uh we tell
ourselves. Um and to do that uh we tell it to serialize uh the size using this special size tag. This is what will make
us get a um an actual array in JSON for example as opposed to an object. And
then we iterate over every element.
Sorry, this is for loading. We iterate
over every element. So we loaded the size to know how many elements there are there. Then we iterate over every
there. Then we iterate over every element. We load the element and we push
element. We load the element and we push it into the inner vector. There's
something interesting here um happening which is that I'm moving the vector before calling push back on it. And this
will make this function work way faster than if you didn't have the move because here we don't care to persist the intermediate values of m right [clears throat]
and if we move it um does know that these intermediate values they're not visible to anyone and will actually do some of the mutations in
place and this will be way faster. In
fact, filling up uh immer vectors using this pattern is similarly uh efficient to filling up uh std vectors with the difference that here you don't have
sometimes like a big reallocation step.
It's a bit more constant. Uh there was a question there.
Where's the index I the index I I the index I is the index uh the number of the of the elements
that I'm loading. So I I load the size.
So how how many how many vectors sorry how many items that this array have or this vector and then I iterate uh that number of times.
Yeah.
Yeah. Uh when saving um I need to first save that size right uh with the same symmetrically we use this uh size tag thingy and then I use email for each and
I just uh give the archive to the email for each. This is also uh something not
for each. This is also uh something not worth uh noteworthy. Why is there email for each when we already have std for each in the standard is because email
for each does leverage the does iterate using uh internal iteration I think is the tech the technical term it goes inside the data structure doesn't use iterators so it's more efficient right
because it it traverses the tree directly without having to if you use iterators and see it's a tree you need to kind of store the stack of the iterate of the thing in the iterator.
It's not much slower. Often I just use a for loop just because I'm lazy uh or I don't want the object size increase that this may introduce. But if you want to
be super fast, use image.
Uh so now for illustration purposes, we're going to create three immer vectors of strings. That's the data that we had in the documents. Right? So if we go back, we had this document,
sorry, a doc, the document was a vector of strings, right? Or contains that. So
we're going to create uh these three vectors. The first one contains four very simple strings. Then we add two new strings to a vector. And then we
have a third state that is actually the same. Right?
same. Right?
Now the history is going to be a vector of documents. Right? So we put all of
of documents. Right? So we put all of this into a vector and we serialize this uh the serial way which is you create the archive and you call the archive
with your value and we get this right.
So uh this value serial thing it's like an artifact of uh serial because it doesn't know what's the name of that root object that you're giving it. Um
there is a way to get rid of it. Uh we
don't need to go through that.
This works. It's nice. But there is a problem with this, right? Uh the problem is that these values are independent.
Let's uh think about it again. If we
represent the three values that we had, they can be represented very compactly with this tree. So we had V1 that has two leaf nodes, right? Because it's a
binary tree with two elements each. Um
then we had V2 that had two extra elements. So it adds a root and those
elements. So it adds a root and those things to the leaf. It doesn't waste any space. No, it only does uh allocates the
space. No, it only does uh allocates the new data that is necessary. And we had V3 that is actually identical to V1. So
sorry to V2.
Instead, when we save, we get three completely independent vectors, right?
Because serial doesn't know about the internal structure of emer.
And this means that when we load, we're going to get three different trees, right? We lost this beautiful structural
right? We lost this beautiful structural sharing and if we did this with a very long undo history, this is potentially going to blow up our our memories, right? This is not fun. Actually, the
right? This is not fun. Actually, the
file already will be potentially really big, right? Because you will have all
big, right? Because you will have all these like arrays with duplicate information where you know A B C D is listed four time sorry three times in
the JSON file when in memory each letter uh appears only once. So this is not nice. Can we do better?
nice. Can we do better?
Well, the way we're going to do better is by introducing the notion of a pool because the problem we have is that each vector is treated completely
independently. Right? Serial doesn't
independently. Right? Serial doesn't
know about the fact that there are some vectors that share structures with others. So, we are going to introduce
others. So, we are going to introduce this notion of a pool of vectors.
Now if we have a pool of vectors, maybe we can teach serial to look into what sharing happens between the the vectors within the pool and actually serialize
in some kind of JSON representation the actual internal structure of this pool.
Then we can load the pool as a whole and out of this pool we can reconstruct the vectors with the shared structure that we wanted. This is pretty nice and this
we wanted. This is pretty nice and this is what this new new library that I'm presenting today that's called emer persist um it's uh in the emer
distribution uh on master um it's a little bit experimental so uh I will uh mention at the end some of the caveats but it's it's usable it works and all the code that I'm presenting today it's
actually runnable code so how can we use immer persist to for our documents the first thing that we're going to is get rid of that serialize function
and we're going to replace it by boost hana adapt strct. Uh now if you're using C++ 26 reflection of course this is not needed. If you're on C++ 20 you can even
needed. If you're on C++ 20 you can even use boost pr to avoid this. Uh in some cases I'm a little bit old school so I
use boohana. Um maybe I could even use
use boohana. Um maybe I could even use boost uh fusion at abstract actually. Uh
Michael and me were mentioning yesterday that we miss the old school uh template meta programming stuff. This morning
actually I I found out a very good explanation for why that's happening in our brains. Um if you're curious ask
our brains. Um if you're curious ask about it in the questions. It's a little bit maybe too much. Um
what HANA adaptstruct does is that it uh informs boost Hannah which is a meta programming library that gives you compile time data structures um
about the strct right. So it tells it which are the members and all of this stuff. So then we can at compile time
stuff. So then we can at compile time get this information and implement the serialized function for example uh in a generic way. So the serialize
now doesn't have to be customized for a document. We can say for every boost
document. We can say for every boost HANA strct. Um to serialize it you
HANA strct. Um to serialize it you iterate over the members using HANA for each uh you get the keys. These are kind of the identifiers of the members of the
strct and then you use this uh serial MVB getting uh what's on that key right?
in.
[cough and clears throat] Now once we have this um we can go back to our previous example. Right? So we
have the vector of strings uh we push those two new elements and we have the third copy. Right? So the data is exactly the same. We have the history
that is exactly the same. But to save it, we're going to use serial save with pools, which is a function from immer persist that leverages serial to
serialize all the data structures that happen in that document tree um using pools. What's cool about it is that actually you don't even need to
specify what the pools are. It will
figure out what the pools are. It will
combine all the vectors that have the same type. Um and then there is this
same type. Um and then there is this policy thing the booana hanastruct automember name policy which means use booana precisely to figure out which
pools there are in the thing. Um give it some names automatically and just get it working right. So if we do this and
working right. So if we do this and serialize now we get this beautiful maybe not so beautiful JSON uh it's definitely less human readable but
what's nice about this is that we do have the structural sharing in it. So if
we look at the value zero thing, value zero is the actual um the actual history right the history which is an array of
three um of three documents right and what you can see is that the vectors that are containing the documents they are not referenced directly right we don't see
an array expanded there it just gives it an ID and you can already notice that the vector 2 and the vector 3 which are the same vector they have the same ID Right?
Then we have the pools and the pools contain all the pools that are needed for our document. Now, since in our document we only had one data structure that was called data, it already gave it
the name data. Um, this is sometimes convenient. Sometimes it's a little bit
convenient. Sometimes it's a little bit confusing because if you actually have vectors of the same type in different parts of your document, it picks up the name of the first attribute that uses
this this vector as the name. But you
can customize this. Don't get too hung up on this. H it actually serializes the template parameters, right? Because um
it's not the same a vector with B1 than a vector with B2. And uh if you mess this up, the structure will be different. You could up
different. You could up overloading. So this allows us to check
overloading. So this allows us to check are we the right format uh when we're loading.
Now we have the leaf nodes. And you can see in the leaf nodes there's no duplication of information, right? We
only have the three leaf nodes that we saw in our picture before.
Then the vectors, there are only two vectors, right? Because v2 and v3 are
vectors, right? Because v2 and v3 are the same. Uh so there only two. And then
the same. Uh so there only two. And then
there are two inner nodes. Now this is actually a slightly different than what there was in the picture because there is this tail thing here that watch the
other mutable structures. uh uh talk if you want to know more about it or I can explain in the questions if we have time. Um it's an optimization uh that
time. Um it's an optimization uh that the data structures have to change a little bit the shape of the tree but for the most part it's the same that we have in the picture right um and this is nice
uh we now have structural sharing in disk.
Uh here is a little overview of what happens in this policy thing. Uh the
policy thing is what does some of the magic that allowed it to be very automatic but you can also do it manually in depending on your application you may want to do some of this first you say what are the pool
types right so this is like a boost hana collection right a compile time thing where you say well I want a pool for the vectors of strings right I want a pool for uh this and that and maybe you want
to skip uh some data structures that because of the way they're used in your um in your application the structural sharing is not relevant uh you can skip them there then you can give them names
and this is the name that you're going to see in the JSON right so we saw we call them data there we can call it something else um and then you can also
customize the save and load for the root value uh so you don't get that value zero thing you can give it a different name like for example history okay really nice so
We managed to get persisting to disk with this pool concept. Can we do something else that is kind of cool?
Well, saving to disk is a transformation of the pool, right? You have a memory representation of the pool and we
transform it to a JSON representation.
We could do any arbitrary transformation on these pools, right? And this is actually useful and it solves this problem. Let's say you want to transform
problem. Let's say you want to transform uh the strings of all these vectors that you have. Right? So you can have a
you have. Right? So you can have a transform email function like this that uses std transform and uses this trick of getting a transient. The transient
just gives you a uh mutable API instead of having an immutable API that is compatible with the standard library right back in stuff like that. So
[snorts] we transform the vector we get uh new vectors uh using a function right. So for example we convert
right. So for example we convert um these three vectors that we had to uppercase.
Now if we do this we have the same problem that we had with the serialization problem. Right?
The transform function doesn't see the internal structure of the vectors. is
going to iterate over every element of every vector and just you know very damp and forward thinking um go and and transform every item and
these things that were before interdependent in their representation suddenly become independent, right? And
we waste memory. We also waste work because this transform function is actually transforming elements that it could have transformed only once, right?
But using pools we can apply a transformation over the pool and get a transformed pool where only the elements
themselves have been applied uh this change right there is a very uh important use case for that in bronze which is actually what I'm going to show here which is uh
how we deal with versioning of documents.
Um there is no perfect way to deal with versioning and all serialization libraries they try something else. Very
often you end up like in your serialization code reading like a version string and doing one thing or another depending on what you do uh of what version it was. I don't like that because it's kind of errorprone like if
you up like in how you're actually interpreting the old version um you may not notice. What we do in bronze is that
not notice. What we do in bronze is that actually for every version of the document we just have C++ types that describe the schema in C++ right because we're using serial or busana um for
serializing them and if we introduce a new version of the document for example instead of a ve a ve a vector of strings it's going to be a vector of charts um
we just copy paste the things that change we make the change and then we have code that actually leverages busana to do most of the conversion automatically. So it traverses the tree
automatically. So it traverses the tree um and only for the things that actually change, you need to kind of write your little conversion function that upgrades, right? And if you forget
upgrades, right? And if you forget something, you're going to get a compilation error. So there's kind of no
compilation error. So there's kind of no way to kind of up um the the versioning unless you modified an old version header, which you should never do, right? In a code review, it's very
do, right? In a code review, it's very easy to see, hey, you modified the V1 header. Uh why did you do that? No, you
header. Uh why did you do that? No, you
have to create a V2 header, copy, paste, and do the change. So uh I like this but this has that problem of like we [clears throat] are transforming between versions of the document and we're kind
of destruct uh destroying the structural sharing if we use the naive approach.
So how do we achieve this? To achieve
this we're going to write a transformation that applies uh describes the transformations that we're going to do on all the pools of the document actually.
So for this we have a booth hana map that contains entries for each type of uh data structure that I have in the document.
Here for the vector of strings I'm going to convert them to I'm going to convert the values right because uh that's that's a key thing here that we're doing right we're doing a transformation on the values of of the vector and that's
why we can preserve the structural sharing. If you were doing something
sharing. If you were doing something else like a filter, you cannot preserve the structure, right? Because the
structure will be different. But if
you're just transforming the the values, um you can preserve the structure and you describe the transformation in this way. Here we're doing a very stupid
way. Here we're doing a very stupid transformation that doesn't really make sense, but anyway, which is we're just getting the first element of the first character of the string, right? We said,
oh, we were using ABC D. Why don't just represent it with the first character?
If there was no character, for safety, we just use, you know, a null character.
Um very stupid transformation but highlights uh what we want to do.
Uh so yeah that's the type of the vector. This is a transformation.
vector. This is a transformation.
You could have many of this right in a practical application where where you have complex document with lots of different pools. Now we have again the
different pools. Now we have again the same example uh with our three vectors our history that contains the three vectors. How do we transform this
vectors. How do we transform this history?
Well, first we're going to say, give me output pools. We distinguish between
output pools. We distinguish between input and output pools. Output pools are pools that you can get values from, right? And we need to get the values out
right? And we need to get the values out of it to transform them. So, this is an output pool. Now, we say transform the
output pool. Now, we say transform the pool using the X form, the transformation that I just gave you.
With this, we get another pool P1.
And now, we can transform the history itself, right? uh for the history
itself, right? uh for the history because there's no structural sharing.
There is only one history, right? We're
going to transform it using in this case uh standard ranges just to be a little bit fancy. Uh and what we have as an
bit fancy. Uh and what we have as an argument is a version one document and we want to return a version two document and we do that by converting the container
uh using the two pools before. You
actually need to pass the two pools because uh well the whole thing happens lazily actually. So when you ask the
lazily actually. So when you ask the transformed pool for a new pool, it's going to look up you know in the previous pool if uh it's transformed or not transformed. Uh and you pass in the
not transformed. Uh and you pass in the container that you want to transform and then you get the transformed container.
So data here in the argument right is the vector of strings. The data that you have in the variable is the vector of charts that you can return and finally
get the new history updated. And if I serialize to disk and to just show an internal structure, you can see that
it's basically identical to uh the other Oh, sorry. [laughter] I was uh waiting
Oh, sorry. [laughter] I was uh waiting for one. You can see that it's identical
for one. You can see that it's identical to the to the previous um document uh or the previous history. The only different that's the only thing that's different is that the leaf nodes they don't
contain in strings. Now they contain charts and the charts are serialized as the code of the character, right?
Uh so yeah, this is pretty cool I think. Uh
and it gives us uh lots of things. There
are more things I can talk about the library. I'm just going to go through
library. I'm just going to go through them uh very briefly. Um I showed how to transform vectors. Uh the hash tables
transform vectors. Uh the hash tables and sets also work. You have to be a little bit careful though because if you're saving to disk, you need to make sure that the hashing function that
you're using is portable, right? You
cannot just use std hash and get a different result on Windows or Mac or even you upgrade the compiler and maybe you get a different result and then your documents will stop loading. The hashing
function has to be stable, right? So, we
provide a grabber for xx hash uh which we found it to be kind of nice. Uh you
could use something else. You can
convert also the key types. So if you have a map, you can actually convert the key type as long as the hash is the same or as long as they're equivalent actually. Um
actually. Um this is useful to have. We have it in our document because often you know we have um type safe keys for tracks and regions and sections and when we change
versions the key type actually change but the actual information that's underneath uh is equivalent. So it's an important feature to have but you have to use it a bit careful anyway when you try to convert something by default it's not going to let you do that and then
you have to tell it like please uh I really want to convert this key type um it's safe also noteworthy complex data models work
I mentioned already if you have lots of um data structures in your tree all these functions they're going to collect all the data structures and um com pile
them up into pools uh per type uh which is really nice. You can have as much sharing in the structure as you want in different parts of the tree. And one key thing that was actually very important for us is that recursive data models
work. And this was actually quite tricky
work. And this was actually quite tricky to get working because you can imagine when you're loading something and it's recursive and the information is kind of in the pool. Um it's not that trivial at all. But this was important for us
all. But this was important for us because we're working on a new feature where actually you were seeing like track with tracks with clips. We have
this feature where you can actually have an arbitrary timeline in one of these clips that expands to other tracks with clips and you can have more tracks with clips in any of these and this can grow
recursively which is very expressive um and pretty nice but it was a little bit of a headache for this feature but it's actually the original motivation that we had to implement email persist because uh we were not supporting persistent
undo history but we wanted to have that feature and with that feature it's very easy to kind of create a very complex uh clip clip in the timeline, right? And
then you just duplicate it around to create loops and move it around and all these duplicates you want to preserve the structure, right? So, this was very important for us.
Um, finally, as I mentioned, the library works, but there are definitely some to-do items that we have. Uh, the first one is incremental processing, right?
So, at the moment, we save the uh collection in one step or the pool in one step. for the ando history case for
one step. for the ando history case for example it would be really nice to be able to save just the differences that have been introduced in the pool by this new document right so so you save the
and steps as they come in incrementally right and we have some ideas on how to implement this but we haven't done it yet um I think binary formats are not working yet uh it will be nice to to get
them working for some use cases where you want a more compact representation which is also something that we can work on it was quite worthy the JSON that you saw before, right? It was, oh, you're sharing structure, but you actually have
way more characters. Um, we can actually make it more compact, but I thought, you know, actually for the slides, it was nice to have some of the things spelled out. Um, serial independence. Now, we
out. Um, serial independence. Now, we
depend on the library serial. Um, I
think we don't fundamentally need to. It
would be really nice to have some kind of hook where you can use your own serialization library. and buildtime
serialization library. and buildtime object size when you have lots and lots of pools uh particularly in our case where for every version of the document we have our own set of pools uh
buildtime and object size get pretty big I think there are definitely ways we can improve on this because um part of the problem now it's all templated right so for every pool you get new serization
code for everything but the for example the internal nodes of the data structure they're the same regardless of what the type of your data structure is Right. So
all of this could be kind of type erased and uh using other techniques I think we can definitely improve both on build time and object size. Um also very important to thank Alex Shabing. He
actually did uh basically 99% of the implementation of what I presented today. Sadly he couldn't come here to
today. Sadly he couldn't come here to CPPON so I showed it uh myself. Uh but
uh yeah he's an excellent developer has been working uh with bronze. He's also
an independent contractor. If you want to hire him uh for anything uh I can definitely vouch for him. He's a a really brilliant developer.
Now key takeaways. So structural sharing very important property of immutable data structures. They enable completely
data structures. They enable completely new architectures in our software. And
with email persist you can leverage structural sharing across sessions of your application by storing it to disk. you can transform the structures and preserve the structure
sharing. Uh and I think this has value
sharing. Uh and I think this has value in a lots of applications. I think it was yeah a key thing that we were missing uh to really be able to to use this value semantics at scale at really
big scale. Um also I will say help us.
big scale. Um also I will say help us.
So uh bronze is a very small startup which sponsored most of the development of this persist stuff. But if you find this useful for your work please hire Alex or any of us to continue improving
on it. Um it's an ongoing progress.
on it. Um it's an ongoing progress.
There are other companies that have contributed to EMR in different ways. Um
and yeah, it's beautiful that with open source, we can all uh contribute a little bit either financially or with code. If you guys want to uh try
code. If you guys want to uh try implementing some of uh the things that I've described or other things that you need for your own use case, please be my
guest. Uh a list of pre previous talks
guest. Uh a list of pre previous talks uh on this topic. I'm not going to go through them now because we're running out of time. I just want to say thank you all for listening. I'm going to
leave it. Some people were taking
leave it. Some people were taking pictures of it. Uh thank you all for listening. Uh and uh if you have any
listening. Uh and uh if you have any questions, it's the time to ask. Thank
you. [applause]
[applause] Yeah. Hello. Very nice talk. Thank you.
Yeah. Hello. Very nice talk. Thank you.
I have two questions. So first uh you mentioned that the the pools are detected automatically uh and no what is detected automatically the pools the pool I mean the pool yeah
that what does involve that I mean it's like uh it's a costly operation to detect uh the pool well it's it happens at compile time uh so that's why it uses booth hana and since we had booana then
we could get rid of the serial code and yeah we use boohana to traverse your document figure out all the types and combine them into these types. So
there's no runtime um runtime pay payment here. There's some payment on
payment here. There's some payment on compile times uh which depending on on your document it may be relevant or not but uh yeah but let's say you have three immer vectors like in your example the library
has to detect that they have share information.
Oh you mean that right? Um that happens yeah as you serialize. Um
Uhhuh.
And so there's a bookkeeping in the serialization. Yeah, that's part of like
serialization. Yeah, that's part of like the serialization. So, uh,
the serialization. So, uh, and well, it's part of Yeah, it's part of creating the pool that is then serialized. Um, [clears throat]
serialized. Um, [clears throat] it's not part of serial.
It's not part of serial, right? Exactly.
It's part of EMR or EMR persist in particular. And yeah, as it goes, it
particular. And yeah, as it goes, it kind of as it's kind of trying to serialize things, it finds the commonalities, etc. But there is no um it's kind of proportional to the amount
of stuff that you're actually writing to this. So proportional to the amount of
this. So proportional to the amount of change that you have in of the size of the actual final pool. So so there's no like secret quadratic crazy thing in there.
Second short question is these EMR vectors support iterators that yes like normal iterators that can be you can iterate them normally. I showed
the immer uh for each uh which can give you a bit of extra performance boost where you need it but you can also use normal iterators and I use them all the Is there any STD algorithm that you cannot use at all because it's so
expensive or all of them work?
Um, there are algorithms that you cannot use at all because it's not even compatible.
Um, and that I don't recommend applying directly. For example, one key example
directly. For example, one key example is sorting, right? When you sort a vector, you need to actually go through the whole vector anyway, and you're producing a completely independent one,
right? and um you need to do a lot of
right? and um you need to do a lot of random access. If you do this on a emer
random access. If you do this on a emer vector, you're probably gonna directly you're going to be kind of paying the price. So what I do in practice is I
price. So what I do in practice is I just copy to STD vector sort and then convert to to vector.
So the category that you gave to the iterator is not random access. You put
something like forward by it's random access. Yeah. But it's not mutable. So that's why you're going to
mutable. So that's why you're going to sort with it.
Oh, that's what prevents the sorting.
Okay. Thank you.
Yeah.
Hi uh thank you so much for the talk and uh more than that is like a actually in our team we are like uh extensively using humor.
Oh um it's like a big we it's like a building block in a very large scale application. Um I got a question is like
application. Um I got a question is like um so if like say we use um in a composite way like say emer of emer as a
value type and all the way down and if somehow the value type has to be a pointer then what would be your recommendations like say how are we
going to uh deal with that you mean in the context of im persist uh in the yeah in the persist in persist and the transition if it's um The value
type is a pointer, right? So when we say get the value, if if we want to get a value and try to make changes, we want to make a true value copy instead of a shadow cop with a pointer.
I'm not sure I'm fully understanding like like if you're using emir, you're going to have a vector of vectors you're saying for example, right?
Like from the point of view of your code, it's a vector of vectors. There's
no pointer is the the pools that going to actually Yeah. The value yeah the value itself
Yeah. The value yeah the value itself it's not like a vector of vector but the vector of a pointer to a vector something like that okay
I see and and you want to see you want to know how to translate these pointers that you're using yeah so once we get say transition at that moment we actually want to enforce true value copy
I see uh I guess yeah so you will need if you're using this pool thing right you will need to translate those pointers which is what the thing is doing under the hood right it's translating the pointers to those numbers and then getting them back. So I
imagine there will be probably a way to do this. Uh I don't have it now in the
do this. Uh I don't have it now in the top of my head. I'm not sure it's actually you can get it out of the API that we have now. But I think conceptually it's something that could be added, right? Because that's part of what the thing is doing. It's
translating.
Yeah. Right. Right now we have our own way to try to overload it to get away with it. U but I just want to see if you
with it. U but I just want to see if you have any like recommendations what be the best. I'm not sure I'm fully
the best. I'm not sure I'm fully understanding actually the the problem.
Maybe you can show me with code after and and we can see if there is like a better Okay, cool.
Thank you.
Thanks.
Can I last question?
Yes, please.
Um, so I'm very interested in iterator categories in in the context of the STL.
So if you if you had all the power in the world to you know adapt the STL to your needs, do you think this involves like a new kind of intermediate category
for iterators that you could exploit?
Like for example Q uses like fragmented segmented iterators, you know, in a hack way.
Um maybe yeah, I haven't really I haven't really thought of that. That could be a thing. I must admit that I have I
thing. I must admit that I have I haven't really prioritized uh uh STL compatibility. I was actually considering that we could have a v2 of emer
that didn't use mutable APIs. Sorry that
that didn't use immutable APIs by default because in the end you can just have a mutable API which is what the transient does and use copy and write. That's
actually the interesting thing is the actual structure that is underneath and you can use copy and write and get exactly the same benefits. Yes,
I'm using immutable APIs just because I was coming from the closure world and it kind of encourages a way to think about things that is a bit more more value.
Um, so there are things that we can do in EMR to bring you closer to the STL stuff.
Uh, I could imagine that there are still things that you would need in the STL like yeah for example if you're changing the vectors okay we're doing copy and right uh you're not getting true references to things. It's a little bit
of like a vector of wool case, right? So
maybe we will need to tweak some of the actual iterator categories and not enough of like a STL lawyer to know be able to tell you from the top of my head which changes you will need.
Um but yeah, it will be if someone wanted to have that that will be an interesting problem to discuss. I mean
I've actually sometimes people ask me like maybe this will be something nice to have in the standard right this this type of container actually because I use this in all my projects now actually but
um I I find the the idea of like yeah making it make sense with the rest of the standard two daunting so I I I haven't even tried but if someone else wants to try also they they can maybe
do this right but yeah definitely what you're saying is what you would need to figure out as part of that work know like yeah how it makes sense with the rest of the STL and the iterator categories Thank you. Thank you so much.
Thank you. Thank you so much.
Oh, thank you.
All right. So if there is no any other question uh thank you very much and yeah see you [applause]
Loading video analysis...