CRDTs: The Hard Parts
By Martin Kleppmann
Summary
Topics Covered
- CRDTs Enable P2P Collaboration Without Central Servers
- Interleaving Destroys Concurrent Text Insertions
- List Moves Require Last-Writer-Wins Positions
- Tree Moves Use Undo-Redo to Prevent Cycles
- Binary Format Cuts CRDT Metadata 200x
Full Transcript
hello everybody my name is Martin clapman I'm a researcher of the University of Cambridge and today I'd like to talk about some of our work on conflict free replicated data types or
CRD T's so I'm going to start this talk with a very brief introduction as to what CR duties are and why they are useful but most of this talk is going to be about new results that we have
figured out only within the last year or two so you might know me from this book here which I wrote a couple of years ago it's called designing data intensive applications and it's a broad overview
of the architecture of data systems and how to choose your database for your particular application that's not what we're talking about today so today we're talking about collaboration software so
in collaboration software I'm taking a fairly broad view of this this is any kind of software where several users can come in and update a document or
database or some kind of state shared state and various examples of software that fits in this category Google Docs is an obvious example where you can have
several people editing a document at the same time create a text document or spreadsheet similar things you get with office 365 figma is an example of graphics software that supports
multi-user real-time collaboration Trello is an example of project management software where several users can update their status of various tasks
assign the tasks and comments and so on so all of these have in common that the users are updating some kind of shared state now I'm going to use this example
collaborative text editing for now there will be some examples from other types of applications later as well so in collaborative text editing you can
end up with a scenario something like this you have to users who at the beginning of the time that we're looking at they have the same document which is hello exclamation mark and then the red
user comes along and changes this document to hello world so by inserting the word world before the exclamation mark and concurrently while the red users doing
this simultaneously the blue user changes the document to add a smiley chase after the exclamation mark so these two changes happen concurrently
without knowledge of each other and then later on as the network communicates those changes so these users communicate over the Internet or whatever network
they are using and they should converge to the same state so we want that at the end both users have the same document on their screen and that document reflects
all of the changes that were made and in this case the merged outcome is is quite reasonable what we expect to expect the word world before the exclamation mark and the smiley face after the
exclamation mark okay and so there are broadly two families of algorithms for achieving this kind of collaborative text editing one is operational
transformation or ot and that's kind of the classic of multi-user collaborative editing that has been around for a long time and that is used for example by
Google Docs and then there's a newer family of algorithms called CID T's which is what I work on which take a bit of a different approach to solving a
similar problem and we will compare these two in a minute so I will start with a brief summary of how operational transformation works because that is
then useful as a comparison point for CID T's so with operational transformation we have to first of all describe how changes are made to a
document and so when a user changes a document they insert or they delete a character at some position in the document and so we can describe the
positions at which those insertions and deletions happen by using indexes so we simply number the first index the first character and the document to be aged
the second character sorry the first character to be 0 the second character to be 1 and so on and so now you can have two users updating their document and here on the
left hand side the user changes the document HDL Oh by adding a second L at position 3 and on the right hand side we
have the user the purple user adding an exclamation mark after the letter and so now after these two changes have happened these changes need to be communicated over the network so that
one user finds out about the changes made by the other user and so from the left-hand side here we have this insertion of the letter L at position 3
or at index 3 in this document and we have the insertion of the exclamation mark at index 4 on the right hand side so let's first of all take the insertion
from the left hand side and move it over to the right hand side and so if there's a server for example that forwards these messages from one user to the other the
server can just forward this insert L at position 3 operation over to the right hand side and on the right hand side we insert L at position 3 and we end up
with hello with two L's and an exclamation mark just as we wanted now what happens if we go in the other direction so taking the exclamation mark insertion from the right hand side to
the left hand side note that if we simply forward this operation insert exclamation mark at position 4 if we simply forward that unchanged then the user on the left hand side would end up
with the document saying hell exclamation mark O which is not what we want we want both users to end up in the same state and for that stake to reflect the intention of what the users were
writing and so instead what has to happen is insert of inserting at position 4 the exclamation mark needs to be inserted at position 5 the reason for
that is that concurrently there has been another insertion at position 3 in before the position for insertion and so the insertion of position 4 needs to be
transformed into position 5 due to the concurrent insertion of the L earlier on so this is why the algorithm is called operational transformation we have to take these insertion and deletion
operations and depending on which other operations happen concurrently we have to transform the indexes or the positions at which those operations take
place so this algorithm does work but it requires one very particular assumption and fundamental assumption in most operational transformation systems is
that all of the communication goes via a single central server so in Google Docs case that server is provided by Google
and the server is you know it's it takes a really key role in this algorithm by sequencing all of the operations and the
assumption of operational transformation is that because we have this central server that is required to sequence all of the operations there cannot be any other communication channels in the
system so even if two users who are collaborating are actually sitting in the same room and they could just be using the local Wi-Fi or even Bluetooth to be communicating their changes between their two devices using that
local communication channel is not allowed because it would undermine the assumptions that are made by the operational transformation algorithm so Ooty requires all of the communication
to go by the central server even if that servers on a different continent and even if the two heaters actually sitting in the same room so this is the difference of where Co deities come in
CR DTS allow this kind of multi-user collaboration without making any assumptions about the network topology so it doesn't assume anything about servers or what kind of network is used
any type of communication channel can be used to communicate the operations or the updates from one user to another and the main correctness criterion that we
have for both ot and Ciardi T's is what we call convergence so what we require is that whenever any two users have seen
the same set of operations then those users must be in the same state and this happens even if the users actually saw the operations in a different order so
what we what we do in CR DT is is we make the operations commutative which means that even if they are swapped around their order is swapped around the end result remains the same and this is
how we can achieve convergence it's just ensuring that after everyone has communicated everyone has the same document on their screen and this makes sense it's of course a very useful property if we did not have this we
would have some kind permanent inconsistency between the users and that would be bad so convergence is clearly a minimum requirement for any of these sort of
collaboration software but as we shall see in this talk convergence by itself is not really enough because convergence doesn't say anything about what that
final state is about after the nodes have communicated so is that final merged state actually the state that we wanted and this is a little bit more
difficult to define because what state is desirable or not desirable is really a matter of human judgment it's what are as humans what is our expectations of
the software of how should the software behave and unfortunately it turns out that several of these IDI algorithms for collaborative editing sometimes have
behavior that is not really desirable that's behaved weirdly in some ways that we as humans would not expect and I have come to the conclusion after several
years of working on Co DTS that basically these problems are quite hard and a simple version of C le T's is very implement very easy to implement but
actually getting it right in such a way that it actually satisfies the user expectations is difficult so C oddities are easy to implement badly and what we
shall have a look at in this talk is how we can implement the oddities well in a way that actually behaves in a way that we want so today I want to look at these
four topics these are all areas in which C oddities are difficult in some way or another and for most of them we have solutions as well so I'm going to just talk through those one by one and the
one that I want to start with is interleaving so interleaving is a problem that can happen in collaborative text editors if several people insert
text at the same position in the document now I'm going to in order to explain why this interleaving happens I'm going to first give a little introduction as to how C oddities for
text editing work and I'm going to start us with an example so as I said earlier when we were looking at operation
formation we identify the position where we want to insert or delete a character by its index simply by counting from the start of the document and that has the problem that those indexes need to be
transformed so see Aditi's avoid the need for this kind of operational transformation by instead taking a different approach instead see our DTS create a unique identifier for every
single character in the document and that unique identifier remains stable it remains the same even if stuff is added and deleted elsewhere in the document so
each character keeps its own permanent unique identifier and there's several possible ways of constructing such identify us one way is we can choose a
number between 0 & 1 so it's some kind of fractional number a rational number and we say for example 0 is the beginning of the document 1 is the end of the document and for every position
somewhere every a character somewhere within a text document we will just pick a number that is appropriately positioned within that number line so here in our example of HDL Oh we could
say okay let's assign the number zero point two to the H the number zero point four to the air II the number zero point six to the Eldar number zero point eight to the O and so
this way we have spaced out our characters over are 0 to 1 number interval and now when we want to insert a second letter L and an exclamation
mark we pick numbers in between the intervals so we want to insert a second L in between the first L and the O so in between zero point six and zero point eight okay let's pick zero point seven
as the midpoint and likewise for the exclamation mark we'd pick a number between zero point eight and one point O so a pig zero point nine for example and
so this is a simple enough scheme and we can use this to implement a CR DT in quite a simple way so what we can now do
is we can think of our our text document as a set of tuples and each of these triples contains the character at a particular position it contains the
numeric value that tells us where this goes and then finally here the aid that I've put in these triples is just a way of referring to the the node
ID that created this particular insertion operation so this is just some kind of unique identifier for a particular user or a particular node in the system and the reason we have that
is just in case two different nodes pick the same number then we can define the ordering on these characters based on
the node ID as a tiebreaker and so we now just have this set of triples representing the document and we can make edits we can insert new characters into this text document just by picking
numbers for them and and then calculating the set Union and we get a set including the new tuples and we can reconstruct exactly what the state of the document is by just sorting those
tuples by their numbers tiebreaking based on node IDs and then we get the text document and several the oddities for text take essentially this approach
they tend to not actually use numbers between 0 and 1 instead they use things like a path through a tree but the effective idea is it's still the same they are just essentially picking
positions on a number line and so when we implement a C or D T in this way we can have some unfortunate behavior and this is an example of behavior that can
occur in some text editing CR duties so here we have two users on the left hand side user 1 who starts off with the text document saying hello exclamation mark
and user 1 inserts the word Alice between hello and the exclamation mark and concurrently on the right hand side user 2 inserts the word charlie
in between hello and the exclamation mark now the two users communicate they merge their changes and we got what we get out at the end is hello and then
some kind of weird jumble of letters that we can't read what what does this even mean how did this happen we started with Alice and Charlie and we ended up with some kind of massive letters well
what happened is this here so we started off with a text document where each each character has a number on the number line
and hello the the Oh had 0.64 and the exclamation our card 0.95 and so then the insertion of Alice or space Alice just choose some chose some numbers in
between that interval zero point six four and zero point nine five and likewise Charlie insertion independently chose some numbers within that interval
and so what a lot of CRT T's do for this kind of thing is they actually randomize the numbers a little bit because that reduces the probability that two
different nodes will pick exactly the same position pick exactly the same number for a particular for a particular character so reduce the risk of
collisions and so what has now happened with this insertion is that well we have taken all of the numbers from Alice on the code which are coded in red here and
the insertions of the Charlie letters which are in blue and we simply ordered them based on the numbers and the result unfortunately is that we've now interleaved those two character sequences that have come from the two
different users and the result is unfortunately a complete unreadable mess now this problem is bad enough if we're
just inserting a single word as we are here it gets even worse if what the two users have done is concurrently inserted an entire paragraph or maybe an entire section into their text document if you
have a paragraph or a section in which the characters have been interleaved on a basically random basis you will not be able to use that text anymore you will just have to delete it and rewrite it again and the users are not going to be
happy about this unfortunately this behavior really does occur in real si oddities and we have examined a number of different at RC OTT
algorithms for text editing and we have demonstrated this problem at least in the algorithms called blue gute and LS EQ and so in these problem in these
algorithms unfortunately this interleaving problem is very deeply baked into the fundamental way how the algorithm is constructed I cannot see any way of
fixing this problem fixing this bug in the algorithms because the way how these numbers are assigned or the waiter paths where the trees are assigned the this
this interleaving idea is just very inherently baked in to the way these algorithms work so other algorithms we have not found this problem so in tree doc and wood for example we did not find
this interleaving problem this does not entirely guarantee that that it can never happen but we think it won't happen and so in a way these algorithms
are safe however unfortunately these two algorithms are also the ones that are least efficient the entire reason why the L seq algorithm was designed for
example is because the designers wanted a CLD T that was more efficient than tree doc and route because tree doc and do unfortunately take a lot of space for their ident for their identifiers and so
unfortunately this performance optimization has now introduced bad behavior in the algorithm and in my opinion that makes the algorithm like LS EQ and la cude basically unsuitable for
use in in many text editing applications moreover there is a specification of collaborative text editing called a strong here which was published by Atia
and others in 2016 and this specification also allows interleaving as a ballad behavior though we have a paper in which we fix this specification
to rule out this interleaving behavior finally our GA is an interesting one so our GA is another text editing CR DT algorithm which does not quite have this
same interleaving problem it has a bit of a less of an interleaving problem and so I will mention that briefly because it's an interesting case study so in our GA the interleaving problem
that can happen is like this so in this example here on the left hand side we've got user one who starts off with the document saying hello then the user inserts the word leader between hello
and the exclamation mark then the user moves their cursor back again after the word hello and types the word dear so what we have here is typing the word read are moving the cursor back and then
typing the word dear on the right hand side we have user 2 who just typed the word Alice between hello and the exclamation mark and what can now happen in RTA is that the word
Alice can get interleaved in between the word dear and the word reader so even though the text dear reader was typed by the one user and the word Alice were
typed by the other user we can end up with the two mixed up a bit so we don't get the same character by character interleaving as we get with some of the other Co duties but we do still have the
ability for text to be inserted into the middle of a different users insertion and the reason this happens is the data structure that RTA uses looks a bit like
this so it's a kind of tree data structure with with very deep sub trees and a small branching factor usually and what happens here is that each node
industry is a character that was inserted and the parent of each node is the character that was the immediate predecessor character at the time of
insertion so we start off with a single special node at the root which is the beginning of the document and then when the user types
h-e-l-l-o exclamation mark each of those letters has as its predecessor the character that was just typed and so you end up with this chain of h-e-l-l-o
luckily like a linked list almost where each characters parent is just its immediate predecessor in this tree and then the users come and position their
cursors in between the hello and the exclamation mark and one type space reader another types space Alice and the
first user who typed reader moved their cursor back and type dear and so you can see every time the cursor gets moved that results in a new sub tree being
formed here and then otherwise the the order of the characters in this document is a depth-first traversal over this tree and so we can see that in the
interleaving that has happened here hello dear Alice reader what has happened is we visit hello then first the dear subtree then the Allis subtree then the winter
subtree and then the exclamation mark and the order in which we visit those sub trees is defined by a time stamp that is attached to every node industry
and so if we have in this case here several siblings so the own ode has several children then we visit those children in descending order of
timestamp and because the time stamps are unpredictable in concurrent editing cases you can end up with these three possible outcomes as the merge results
so it's not as bad as the character by character interleaving that you get with the others I should be precise it is possible to get character by character into leaving
an RG a the circumstances in which that happens is if the users type their document back to front so if the users start at the end of the document and first type the last letter then move
their cursor back to the beginning type the second to last letter move to cursor types third last letter and so on and type the entire document from the end to the beginning if they do that then they
would actually get character by character interleaving in RGA because in that case in this tree here all of our characters would be siblings there would all be children of the root
node and in our case the ordering is just defined by timestamps again but fortunately in practice people tend to not actually type documents like that and people tend to type documents in a
reasonably linear fashion from beginning to end and of course occasionally people will backtrack a bit delete a bit copy and paste some sections and so on but as the general rule
typing still happens in a fashion from beginning to end and so for that reason the interleaving problem is not actually as bad in our GA as it is in the other algorithms
however we could still say we want to be strict and we want to not allow any interleaving at all and we did actually devise an algorithm so for our GA we
were able to find a fix a modification to the algorithm which changes it such that this kind of interleaving of concurrently insertion does not happen regardless of how the cursors was moved
I don't have time in this talk to talk about this particular rhythm but if you're interested there's a paper that we published in 2019 called interleaving anomalies in collaborative
text editors and in this paper the algorithm is described exactly so let's
move on to our next topic our next topic today is reordering list items that is we have a list of things and we want to
move an item from one position in the list to another so one example in which an example application in which we might want this is a to-do list so in a to-do
list often you can drag and drop items into whatever order you want and so what happens here for example is you have a to-do list consisting of firstly by milk
secondly water the plants thirdly phone Joe and the user is dragging and dropping the phone Joe item and moving it to the top of the list so that this
is now at position one so how do we do this well we can represent the list using any of the CEO duties that we talked about earlier so all of the CEO duties for text editing are also Co
duties for lists and so we can use those to represent this ordered list of items however all of those CR duties don't actually support a move operation all
they have is operations for inserting and deleting items but no operation for moving now you might think ok well that's not too bad because we can always
implement moving by just deleting the item from its old location and inserting it at its new location and this works except it has a problem if several
people can currently move the same item so in this example here replica a and replica B both move phone Joe to the top
of the list and so what happens if we implement this moving by deleting and reinserting well they both delete phone Joe from position 3 so it's definitely
gone from position 3 and then they both insert phone Joe at position 1 but the result now is that these two insertions at position 1 they both happen and so we
end up with two copies of phone Joe at position 1 & 2 so this item that we wanted to only move has accidentally got duplicated in the process and this is really not the kind
of behavior that a user would expect I think over to do list application just because they reordered some items on their list doesn't mean they certainly want to copies of those items appearing
so well what is the behavior actually does we want what is the desirable behavior let's look at a similar example here in this case we've also got two
replicas moving phone Joe but we've got the moving that item to different positions so replica a moves phone Joe to the top of the list whereas a replica
B moves phone Joe to be at the second position so after buying milk so what is the desirable behavior in this case well we could say that now our phone Joe should appear in both positions before
buy milk and after buy milk but then we've got duplication again and we did not want duplication so really what we want is phone Joe to appear in one of the two positions I guess that maybe
show a warning saying hey use odd as a warning we we picked one of the two positions or maybe it's just fine to just pick one of the two and forget that there were two different list positions
so I'm just going to say now let's we're going to pick one of the two concurrent positions as arbitrarily but deterministically as the winner and so
in this case example here we've got phone Joe being at the top of the list as the winner so from the point of view of the first user nothing changes but from the point of view of the second
user after they've moved from Joe to position two then additionally the phone Joe is then subsequently moved to
position one and now let's think about what is happening here so we've got two concurrently written values for the two different positions and we want to deterministic ly
but arbitrarily but deterministically pick one of them as the result as the winner as the outcome and this may look familiar to you if you have studied
existing CEO duties because it's just what happens in a last writer winds register in the last writer in red we've got except a register which is a
variable effectively which can concurrently have a value assigned to it by different users and if we have those concurrent assignments the end result
after everybody is merged their state is that we pick one of those values that has been written as the winner and we throw away any other values that have been written concurrently and so you can
think of here as the position of the list item phone Joe as having a last writer wins register semantics to it
that is first one user assigns the position of phone Joe to the head of the list the other user assigns the position of fern Joe to be after by milk and then
as the two users communicate then the end result is the winner is one of those two values either head of the list or after by milk so let's make this a
little bit more formal what we need is if we want to implement this here we need a last write of wins register okay second we need some kind of an ambiguous
way of referencing positions in the list but if you think about what we've just done earlier when we were talking about unique identifiers for the characters in a text document what the lists the
oddities and these text see oddities already doing is providing us with a stable and unique way of referencing particular positions in a list so we can just use one of these existing
algorithms and use its unique identifiers for list elements in order to reference the position of a particular item endless now difference er duties have different ways of
constructing those identifiers as we saw like tree dock uses a path through a binary tree log ooh chooses a path through a tree with a higher branching factor RGA and causal trees use some
kind of time stamps but the basic idea is still the same we've got unique identifiers for particular positions in the list and so we can take those identifiers and put them as the value
inside our last strata wins register secondly what we need now if we've got several list items is we need a separate register for each list item to hold the
for that particular list item and so well we can just construct a set we can for example use a a twin set set CR DT and the contents of that set are going
to be all of the items on our list and so we're going to say to each item on the list is going to be a pair consisting of a value which is like the text on the to-do list item for example
and they last writer wins register containing the position of that particular list item and so now we can add new items to the list by adding them
to the add win set we can move position from one position to another by updating the value of their last writer wins register in particular we use a list C R
DT to create a position identifier for the location where we want to move a particular item and then we take that position identifier and use it as the value in the last write wins register so
here we've constructed a new C R DT operation now a move operation for lists just by using three existing CRD T's any of this this range of different lists
see our duties plus an ad win set plus a last writer wins register and we've constructed a new operation and because we've just composed these existing C our duties we know that the end result is
also going to be a CLE T so this is very nice and it allows us to do these atomic moves of a single list item at a time now things get a little bit more
difficult if we want to move from more than one item at a time and so in a to-do list you really only need to move one item at a time because they typically drag and drop a list lets you
pick up one item at a time in the user interface so that's not really a problem but in text editing you could imagine other circumstances arising so in this
example here we've got a to-do list this time represented just as plain text as a list of bullet points wage bullet point starts with a bullet point character and
ends with a newline character and so here replica B is taking the item milk and wants to move that in front of bacon and so that means taking all of the
characters taking a character bullet point space m IL k new line the whole range of characters are moved a whole range to be in front of bacon
concurrently with that happening a different user is updating the milk item to say soy milk and so they do that by deleting the uppercase M and inserting
the text soy and the lowercase M okay so what do we expect to have happen here so firstly the soy milk sorry to change
from milk to soy milk takes effect and the move from milk to be in front of bacon takes effect so we would expect the desired outcome here is that we have soy milk first in the list in front of
bacon so that means that two changes have become merged together cleanly this is what we want to have happen unfortunately it's not what actually happens so if we look at the algorithm
that we just constructed for moving a single list item and if we just move each character individually according to this algorithm the end result that we get is this here so we end up with milk
is correctly moved in front of bacon that's happens exactly as we wanted but the change from milk to soy milk remains attached to the old position of where
milk was previously not the new position where milk has now been moved to and so the end result now is that the list reads milk new line bacon new line soy M
and that soy M is just standing there without any context because its context has been moved away in the meantime and this is a problem that I do not know how to solve at the time for the time being
I've spent a while thinking about it come up with a couple of half solutions that don't really work a couple of other people are thinking about it as well so if you're interested in this feel free
think about it and do publish the result if you manage to figure it out so this is the problem of moving items in lists so we've figured out how to move a
single list item at a time but these ranges are characters moving them in a way that behaves nicely is still an open challenge let's move on to point three which is
again about moving but now rather than moving elements in lists we're going to move sub trees in a tree so let's again look at an example let's say we have a
tree structure which might be set a file system and like in the case of moving elements in lists we can think about what happens if the same node in a tree
is concurrently moved by two different users to two different locations and so you can see here we start off with the tree on both replicas where a B and C
are children of the root and on replicas a we're going to take that that node a and move it to be a child of B whereas
on replicas two we're going to take the same node a and move it to be a child of C so now a has been moved to two different locations on two different
replicas in this case what do we expect the outcome to be and it's kind of similar to the case of moving elements in lists so one option is to duplicate
it and so what we can have is that we have a appearing as a child of B and also a appearing as a child of C this means then further any children of a
will also have to be duplicated and that's what you get here in the in this box labeled a that is you have the the a
subtree with a 1 a 2 a 3 children and then another whole copy of the whole subtree so this is duplication like in the case of the lists I think
duplicating nodes on concurrent news is not a good behavior so we don't want that well another option is for a to be
actually shared as a common child of both B and C this works but it's no longer a tree so if what we expect our data structure to be is a tree rather
than a dag or some other kind of graph then this is not an acceptable outcome because here a has two parents and in a tree no node ever has two parents so
that leaves C and D as the post outcomes see is simply we take the replica one operation as our outcome and
ignored the move made by replica two or vice versa we can choose replica two to be the winner and to ignore the move made by replica one so in this case
either a a pipe appears as a child of B or a appears as a child of C but not both so like in the case of of moving
atoms in lists what we have is a kind of last writer winds behavior here and I think that is reasonable now in trees additional complications appear and so
trees are definitely a harder case than lists and so another example of tricky things that can happen in trees can be illustrated with your file system so if
you want you can try this on your computer and just create a directory called a and then create a sub directory within a called B and then try moving a
into B so move a to be a child of a slash B this may sound weird what you are doing here is effectively trying to
move a directory into itself I don't know if you've ever tried this I did try it on Mac OS I get this response here from that from the shell saying invalid
argument I'm not allowed to do this movie and this makes sense because if we did perform a move where we move a directory into itself we would kind of be creating a cycle and then this would
our data structure would not no longer be a tree it would be some kind of graph with cycles in it and this would be very confusing for a file system so if we build see oddities or trees it also
makes sense that if we allow moving sub trees we also prevent this kind of cycle and prevent somebody moving an object inside itself but unfortunately with C
oddities we have the additional complication of concurrent changes so consider this example here where we start off with the same tree again on
both replicas on replica one we take the node a sorry we take the node B and move it to be a child of a so here B
appears as a child of a while C reigns as an existing child of a on replica two we start off with the same tree and in this case we move a we move a to be a
child of B so we end up here with a as a child of B and C as a child of a as before what happens now if we merge these two so we've got here to move
operations each of which by itself is perfectly fine there's nothing wrong with moving B to be a child of a and there's nothing wrong with moving a to be a trial to B but if we combine both
of these we end up with exactly the same problem as we just had we're effectively one directory gets moved inside itself and so if we are not careful with this
algorithm we end up with a and B in this kind of cycle disconnected from the root of the tree and well this is again no longer a tree it's some kind of more general graph data structure so we don't
want this if we want to preserve the factor of grandpa tree another work option as before is we can duplicate nodes so we can say that well we move B
to be a child of a and we move a to be a child of B so we have both options both B as a child of a and a as a child of the book that simply exists in in our
directory graph in our tree and so again we have to duplicate any of the children of those and as before again I think duplication is the wrong way of doing this so it seems as before what we
really want is a last writer wins semantics here where either this option here that is the outcome of moving B
into a and ignoring the other move or we choose the outcome of moving a into B and ignoring the other move so in this case we want to pick one of the two
conflicting operations and we want to just ignore the other one what we have to ensure that all of the replicas end up picking the same operation as the winner and ignoring the same other
operation because of course otherwise they weren't converged and then it's not a CR DT so how do we actually achieve this now where we want to ensure that all of these
conflicting operations exactly one gets picked well I don't know we can try some existing software I tried this with Google Drive for example and I got this
wonderful error message here I tried exactly just having two directories a and B move a into B and move me into a concurrently on two different computers
what happens I get some kind of internal unknown error try again but this error never goes away it just keeps sitting down and loop trying to sync yeah so Google Drive
hasn't solved this problem either okay I guess we'll have to think about it for ourselves so the way that we solve this is by thinking about the sequence of
operations that get applied on each replica and having a time stamp on each operation and we're just going to assume some globally unique timestamp as usual so you can use Lamport time stamps for
example that works fine and of these operations some of them are going to be move operations and some might be some other kind of operations and here we
have on replica one we have moving a into B and moving B into a on replica two and as I said before each of these operations by itself is safe and fine to
execute what we need to ensure that we detect the case when two operations are in conflict with each other in the sense that executing both of the operations
would create a cycle so we have to ensure that no matter what we do we do not create a cycle now what can happen now if these two replicas merge there
are sequences of operations is well we take the union of the operations from both of the replicas we can put them in
time snap order and now if we say that the operations get executed exactly in increasing time stamp order then the first move is fine because by this point
nothing bad has happened yet but then by the time we get to the second move operation that move is unsafe and so if we want to ensure that there are no cycles in our tree then we have to skip
that operation just pretend it didn't happen this is our tricky because replica 2 has already executed this operation so replica 2
will sell somehow have to undo this operation that it is really performed in order to ensure that it converges again with replica 1 so the basic principle
we're going to use is to take these sequences of operations where we have a time stamp on each operation and we're going to merge them such that we get a
single sequence of operations in time stamp order and so for example here on replica 1 we have time stamps 1 3 4 and 5 on replica 2 we have time stamps 2 6 7
& 8 and so if we imagine your replica one and you've executed 1 3 4 5 and now you receive the operation with time stamp -
from replicas - well what you need to do is you need to take this time stamp - operation and insert it somewhere earlier into your sequence of operations
that you've executed but you know time has already moved on by this point and we've really executed the operations with time stamp 3 4 & 5 so how do we handle this well we can just take a
fairly simple approach which is to undo and redo operations and this is exactly what we do in our algorithm and so that is if you are replica one here that wants to apply operation - and with time
stamp - you've got your log of operations that you've performed where x x 1 3 4 & 5 and we're first going to undo operations until we are back at the
point until we've undone all of the operations with time stamps greater than the new operation so that is we're going to undo up 5 we're going to undo the move we're going to undo up 3 now we're
only left with off 1 so now we can apply up to on top of that with time stamp - and now we we do the three operations that we've undone and the end result is now that we have this log of operations
we're up to has been inserted in the right place and if we keep performing this kind of logic on each of the replicas then this will actually ensure that all of the
replicas end up with the same operation as exit shoot it in the same order just means we have a bit of extra work to do because if some operation needs to be inserted
somewhere quite early on in the order we have to do a lot of undos to do that one operation and then a lot of reduce to replay all of the operations that have happened so you might wonder what is the performance of this isn't the
performance going to be terrible well we did some performance experiments and it was okay it was of course there was certainly a cost in order of performing
all of these undos and redos and you can see that here so we set up a system with three replicas on three different
continents and so there was a delay of over a hundred milliseconds round-trip time between these different replicas and we then just started generating move
operations at a high rate and the higher the rate of move operations in this system the more undo and redo these need to be done and so the slower it becomes to execute each individual operation and
we were able to get this system to at least 600 operations per second which you know is it's not massive on sort of big data scales but on the other hand
for like a single user updating their file system I think I think a single user is not going to perform more than 600 move operations per second on their own local file systems so for a lot of
collaboration software actually this performance is it's perfectly fine and we don't have to worry too much so I will explain a little bit more about how
this how this algorithm works in terms of the data structures that we use and so we have to first find a way of describing the move operation that we
want to perform we're going to say move operation is a structure here with the following fields first of all like I said earlier and weave operation has a globally unique time stamps such as a LAN port time stamp then a move
operation has got to have a child which is the the node in the tree that we are moving and the parent which is the new location the new parent of that child in
the tree now we are not going to note in this operation here what the old parent of this child was so this is like we can call this stealing move that is we're simply going
to take child from wherever it currently is in the tree and move it to be a parent of this new parent here and further we can have a bit of metadata associated with the operation so for
example in a file system you have at one file within a directory and the file has a name within the directory and so that can be the metadata that is associated
with a particular file in a particular directory and then we have to construct this log of operations in order to perform the undo and redo and in order
to perform this undo we need to associate a little bit of extra data with each entry and the log so each entry here has several fields that come
straight from the move operations so the timestamp is from the move operation as before the new parent the new metadata and the child these come directly from the move operations and so in addition
to these we've now also recalled the old parent and the old metadata so that is what was the parent of child before we performed this move and was what was its
old file name for example if we're using the metadata for file names and so now this allows us to perform the undo because what we do in order to undo this
operation is just to set child's parent and metadata back to its old parent and it's old metadata and given this log of
log entries of move operations we can now construct the current state of the tree the tree is just a set of triples of current metadata and trial triples
whenever child is a parent or parent in the tree and so we can now define this ancestor relation here so we say that a is ancestor of B if either a is a direct
parent of tree as in there's a comma M comma B triple exists in the tree or if there exists some tree node C such that
a is a parent of C and C is an ancestor of me okay so this defines just a transitive closure over the parent-child relationship
and so given this definition of ancestry now we can define what it makes for a move operation to be safe so for a move operation if we want to
perform this move we look at the child that is being moved and the parent which is the destination of where the child is being moved to and if the child is
currently already an ancestor of the destination where it's being moved to then this would be unsafe we would create a cycle so we do nothing similarly if the child and the parent
are in fact the same node that would also be unsafe so again we do nothing but in all other cases it's fine so we
can update the tree by removing from the tree the node the existing location of C that is this is exactly what I said we steal the child node from wherever it
currently is in the tree for any peak on Miami here and then we add this new parent metadata child relationship to the tree and this defines our updated
state of the tree and so given this function here for performing a move operation and given the undo/redo procedure that I explained previously
we now have an algorithm for safely performing move operations on a tree C R DT and we proved several theorems about this algorithm so in particular we
showed that this algorithm does indeed preserve the tree structure so to preserve a tree structure what we need is that every node has a unique parent and that the tree contains no cycles
that is the definition of a tree and so we prove that these things hold so for any sequence of operations that are executed the data structure remains a tree and then obviously we have to prove
that it's a CR DT which means that it converges so given any two sequences of operations on this tree if one sequence of operation is a
permutation of the other sequence of operations then applying those sequences of operations leads to the same state in other words operations are commutative
so this algorithm works and we have implemented and and proved these properties about performing move opera
Shinzon trees okay moving on from moving subtrees NCR duties let's talk about efficiency and making co DTS more
efficient so one particular problem that occurs with a lot of CDT algorithms especially the text editing CR DTS is the overhead that the metadata the CR DT
metadata incurs so if you think about a text document what we said is we give every single character and a text document a unique identifier and so what you end up with is each character in
English text would be one bite for the ASCII character the unicode utf-8 encoding of that character but then you have additional metadata so you might have for example some path through a
tree which might take a couple of dozen bytes to encode then you might have some identifier for the node that inserted a particular character which if that is a
UID will be at least another 16 bytes if you encode it in hexadecimal it'll be like 36 bytes or something like that and so very quickly you end up with this really disproportionate situation
where you have one byte for the actual data that you want and something like a hundred bytes of metadata or even more than that just for the CR DT metadata in
order to make the CDT work and that's not a particularly great situation to be in so what I would like to talk about now is some work that we've been doing as
part of the Auto merge project which is a CR DT implementation that I work on in order to bring down that metadata overhead and we've had some really good
successful results by using some carefully designed algorithms and data structures so let me start with the results of what we have here so the old
version of Otto emerge uses a JSON encoding in order to encode all of the metadata and send it over the network and write it to disk and this JSON
encoding is incredibly verbose so what we've done here is as an example to take a particular case study which is the editing history of the latex source of a paper
so in order to generate this editing history a colleague and I actually wrote this paper using a homegrown text editor and so we were able to capture every single keystroke that went into the
writing of this paper so the final file is about 100 kilobytes of plain text so this is just ASCII text containing latex
markup and we captured over 300,000 operations that is the in 300,000 individual keystrokes of the editing history going into that and so that
includes all of the individual characters that were typed in order to create this document all of the characters that were deleted for any stuff that was added or removed or any typos that were corrected and so on and
we also recorded all of the cursor movements so every single time the cursor was moved from one position and document to a different one that was
also recorded and so if you take this history of 300 thousands or so changes and you encode it in the current or to
merge data format in this JSON data format the file size is about a hundred and fifty megabytes so that is we're
using almost 500 bytes per change here which is which is pretty terrible and so you can see now in the in the new binary
data format that we've designed in order to bring down that over metadata overhead we've actually made some some really big improvements on that and so
what we have done is first of all designed a binary format that encodes the full editing history including every single insertion and deletion and
including when that happened and encodes that in a compact form so it doesn't lose any information at all this is a completely lossless encoding and what
we've managed to do is reduce this file size by over a factor of 200 down to about 700 kilobytes containing exactly the same information just encoded
differently so I should emphasize if you would just take the JSON history and encode it using one of the binary equivalents for Jason you would not anywhere near this so if you use
protocol buffers or message pack or something like that you know you you might reduce it by a factor of two or three at the most
but in order to get this factor of 200 improvement you have to design data format doses specific to see oddities and specific to the particular heading
patterns that tend to occur in these sorts of applications now you concede that by gzipping these files you can reduce it so you can reduce 150
megabytes down to about 6 megabytes but the 6 megabytes are still huge compared with the 700 kilobytes that we were achieved uncompressed in our binary
format and that still does gzip down further to about 300 kilobytes without losing any information at all so note
that by this point here now we've been able to Inc encode a change history the change history consists of over 300,000 changes we've encoded that in 300
kilobytes so we're now actually using less than one byte per change in order to encode the full change history so we can look back at any past version of
this document week and if any two versions at any points in time and all of that data is it's still there encoded in this very compact form which I find very exciting now of course we can still
compress it further by throwing away some of the history so one of the first things we might choose to throw away is the cursor movements because to be honest who cares about exactly where the cursor
was at which point in the editing history so we can save about 22% by throwing away two cursor movements and
we can we can go down to about 230 kilobytes by also throwing away all of the change history for the text so by this point here where we're at 230
kilobytes we still have all of the CR DT metadata including all of the tombstones so this means at this point we're still able to
perform arbitrary merges between different replicas of this data and yet we are down to only about twice the
or data set size so about 100 kilobytes is the raw text document with no CR DT metadata at all that's just the final state of the text document and all of
the CR DT metadata is adding only a bit more than 100 percent overhead to this without gzip if we actually apply gzip then that file sizes are almost the same
so the the file containing 2 CR DT metadata gzipped is about the same size as the plain text file non gzipped now
we can get even further down by removing the tombstones the tombstones are the markers for deleted characters in the
document and if we do so then we save what is that another 70 kilobytes or something like that for the tombstones whether you want to do that or not
depends because if you throw away their tombstones it means you can no longer merge with edits that happened concurrently so any edits that happen before that to stone removal or
concurrently with the tombstone removal now can't be merged anymore but but if you do remove the tombstones then just the raw CR DT metadata just that is just
the unique identifiers for the characters at that point takes only 50 kilobytes so it's only about 48 percent overhead on top of the the raw text
document using using our data format very exciting I find this so I should tell you a bit about how this actually works and how we were able to achieve
such good compression the basic idea is as I said we want to still maintain all of the change history as much as possible and so we're going to store the
full set of all of the insertion operations all the deletion operations and all of the custom movements if necessary each operation is given a unique ID which is just the lamp or timestamp and I'm poor timestamp is just
a pair of a counter and the ID of the node generated it or we call it actor ID in auto mode and the the algorithm that we use for text editing is based on CR
DT so it's based on whenever you perform an insertion you reference the in the identifier that is the lamp or timestamp of the character
that is immediately before the inserted position and we keep this set of all of the operations and stored in document order that it's in in the order in which
the characters appear in the document and what we get is this kind of table here so you can think of this almost like a table in a relational database
where each row is one operation and so you can see that is typing h-e-l-l-o so it's hello were three L's and then
somebody came along and deleted one of the three L's so that the final result is just hello and so the way this is encoded is each row here is a is an
insertion operation into this document and you can see that each row has an operation ID which consists or which is a lamp or timestamp so it consists of a counter that just increments and the
actor ID I just wrote a here but in reality our actor IDs are actually new IDs so they're actually like 16 bytes or even 32 bytes long and and so we have
this unique identifier for each for each character dispenser that was inserted here we have the reference element that is the predecessor character the idea the predecessor character and so this is
just saying here the e comes after the H so the H has ID 1a the e has ID 2a and comes after 1a and so we can then encode
all of the operations this way furthermore if an operation is marked as deleted we do that by having these additional deletion columns that we can
regroup sorry record the operation ID of the operation that deleted this particular character in fact it could happen that there are multiple deletion operations for the same character if
multiple users deleted the same character and now we can take this table and include it in a very compact form using some fairly simple tricks so we'll
go to first of all encode each column of this table individually and so for example we can look at this first column the counter call which consists of two numbers one two three four five six how do we encode
that efficiently well first of all we Delta encode it which means for each number we calculate the difference between that number and its predecessor so one two three four five six the difference between any two subsequent
numbers is one so this delta encodes 2 1 1 1 1 1 1 which we can then one length encode to 6 times the number 1 and then we can use a variable length in binary
encoding for integers to encode 6 comma 1 in 2 bytes so this variable length encoding just ensures that small numbers get represented in one byte slightly bigger numbers get represented
in 2 bytes bigger numbers still get represented in 3 bytes and so on and so here we've now encoded this entire counter column here in 2 bytes similarly
we can proceed to the actor ID column so for this actor column well first of all if we make a lookup table we don't need to keep repeating the huge IDs so we
just map a to 0 B to 1 for example and so now this translates into 0 0 0 0 0 0 we can run length encoding use the variable length
binary encoding again running again we've encoded this entire column in 2 bytes moving on let's look at this actual text column here what I've done
actually represented the text in two separate columns so here this is the the utf-8 byte string for that particular character that is being inserted and in
this length column here we have the number of bytes for that particular Unicode character so if what we're doing is just typing English text then every
single character is going to be within the 1 byte encoding of utf-8 and so this column here is just going to consists of lots of ones it's just all of the
characters have length 1 so that encodes again super compactly and now as we've encoded the length columns for the actual characters we can just concatenate them because we can still
then later split out where the boundaries are between the characters by just looking at the length column and so we're simply going to concatenate all of the bytes in the utf-8 column hello with
triple L which is encoded in six bytes so what we've done here just by encoding each column individually we can represent each
column extremely compactly in just a couple of bites and it turns out then that because of the editing patterns that you tend to get with text you tend
to get these increasing sequences because people just tend to type letters in sequential order which means you get to use increasing sequences of counters
that always increase by one each time so the whole data has a structure which compresses really nicely and we can take advantage of our knowledge of this
structure to encode it very compactly so this gives us the set of operations now in order to have the full change history we need some little bit of extra
metadata that is for each set of changes that were made at some point in time we need the timestamp of when that was occurred and the range of operation IDs and with this extra metadata we can now
reconstruct exactly what the document looked like at every passed point in time even while using a very small
amount of additional storage space so if people tell you that with CRD T's you know you have to put effort into tombstone collection because tombstones waste a lot of space you know just think
back at this table here the tombstones actually just cost us 48 percent of the space it's a fairly small amount of costs to to maintain here whereas
choosing the difference between a simple JSON format and an optimized binary format made a factor of 200 difference
so I think this means we have the ability if we want to store the full change history and to allow us to do you know dips and interesting visualizations
or to change history for our documents with very low cost very low cost and we don't actually have to throw away all of
this extra metadata that we have so this brings me to the end as I said at the beginning I think C oddities are very easy to implement badly so they're very easy to implement
in a way that is inefficient and that has all sorts of weird behaviors like the interleaving anomaly that we saw and they're all these problems of how do we
do a move operation both on lists and on trees in a way that is safe and that is recently intuitive to our users but as you've seen from this talk we have made
a lot of progress in just the last few years on these topics and I think it's going in the direction we're really see oddities are starting to become something that can be really the
foundation for a very large set of applications so we can start to build like actual real practical user software not just research prototypes on top of
these these basic principles so I hope you will join me in finding this very interesting and exploring where this ends up going in the future so I will
end with some references that is here first of all the list of the various text editing CRD T's that I talked about in the context of interleaving and
finally the publications that I've worked on for the various algorithms and anomalies and problems that we talked about in the context of this talk all of
the slides are online so you can find them there thank you very much for listening I hope you enjoyed this talk and I'll see you again soon
Loading video analysis...