CRDTs: The Hard Parts

By Martin Kleppmann

Summary

Topics Covered

CRDTs Enable P2P Collaboration Without Central Servers
Interleaving Destroys Concurrent Text Insertions
List Moves Require Last-Writer-Wins Positions
Tree Moves Use Undo-Redo to Prevent Cycles
Binary Format Cuts CRDT Metadata 200x

Full Transcript

hello everybody my name is Martin clapman I'm a researcher of the University of Cambridge and today I'd like to talk about some of our work on conflict free replicated data types or

CRD T's so I'm going to start this talk with a very brief introduction as to what CR duties are and why they are useful but most of this talk is going to be about new results that we have

figured out only within the last year or two so you might know me from this book here which I wrote a couple of years ago it's called designing data intensive applications and it's a broad overview

of the architecture of data systems and how to choose your database for your particular application that's not what we're talking about today so today we're talking about collaboration software so

in collaboration software I'm taking a fairly broad view of this this is any kind of software where several users can come in and update a document or

database or some kind of state shared state and various examples of software that fits in this category Google Docs is an obvious example where you can have

several people editing a document at the same time create a text document or spreadsheet similar things you get with office 365 figma is an example of graphics software that supports

multi-user real-time collaboration Trello is an example of project management software where several users can update their status of various tasks

assign the tasks and comments and so on so all of these have in common that the users are updating some kind of shared state now I'm going to use this example

collaborative text editing for now there will be some examples from other types of applications later as well so in collaborative text editing you can

end up with a scenario something like this you have to users who at the beginning of the time that we're looking at they have the same document which is hello exclamation mark and then the red

user comes along and changes this document to hello world so by inserting the word world before the exclamation mark and concurrently while the red users doing

this simultaneously the blue user changes the document to add a smiley chase after the exclamation mark so these two changes happen concurrently

without knowledge of each other and then later on as the network communicates those changes so these users communicate over the Internet or whatever network

they are using and they should converge to the same state so we want that at the end both users have the same document on their screen and that document reflects

all of the changes that were made and in this case the merged outcome is is quite reasonable what we expect to expect the word world before the exclamation mark and the smiley face after the

exclamation mark okay and so there are broadly two families of algorithms for achieving this kind of collaborative text editing one is operational

transformation or ot and that's kind of the classic of multi-user collaborative editing that has been around for a long time and that is used for example by

Google Docs and then there's a newer family of algorithms called CID T's which is what I work on which take a bit of a different approach to solving a

similar problem and we will compare these two in a minute so I will start with a brief summary of how operational transformation works because that is

then useful as a comparison point for CID T's so with operational transformation we have to first of all describe how changes are made to a

document and so when a user changes a document they insert or they delete a character at some position in the document and so we can describe the

positions at which those insertions and deletions happen by using indexes so we simply number the first index the first character and the document to be aged

the second character sorry the first character to be 0 the second character to be 1 and so on and so now you can have two users updating their document and here on the

left hand side the user changes the document HDL Oh by adding a second L at position 3 and on the right hand side we

have the user the purple user adding an exclamation mark after the letter and so now after these two changes have happened these changes need to be communicated over the network so that

one user finds out about the changes made by the other user and so from the left-hand side here we have this insertion of the letter L at position 3

or at index 3 in this document and we have the insertion of the exclamation mark at index 4 on the right hand side so let's first of all take the insertion

from the left hand side and move it over to the right hand side and so if there's a server for example that forwards these messages from one user to the other the

server can just forward this insert L at position 3 operation over to the right hand side and on the right hand side we insert L at position 3 and we end up

with hello with two L's and an exclamation mark just as we wanted now what happens if we go in the other direction so taking the exclamation mark insertion from the right hand side to

the left hand side note that if we simply forward this operation insert exclamation mark at position 4 if we simply forward that unchanged then the user on the left hand side would end up

with the document saying hell exclamation mark O which is not what we want we want both users to end up in the same state and for that stake to reflect the intention of what the users were

writing and so instead what has to happen is insert of inserting at position 4 the exclamation mark needs to be inserted at position 5 the reason for

that is that concurrently there has been another insertion at position 3 in before the position for insertion and so the insertion of position 4 needs to be

transformed into position 5 due to the concurrent insertion of the L earlier on so this is why the algorithm is called operational transformation we have to take these insertion and deletion

operations and depending on which other operations happen concurrently we have to transform the indexes or the positions at which those operations take

place so this algorithm does work but it requires one very particular assumption and fundamental assumption in most operational transformation systems is

that all of the communication goes via a single central server so in Google Docs case that server is provided by Google

and the server is you know it's it takes a really key role in this algorithm by sequencing all of the operations and the

assumption of operational transformation is that because we have this central server that is required to sequence all of the operations there cannot be any other communication channels in the

system so even if two users who are collaborating are actually sitting in the same room and they could just be using the local Wi-Fi or even Bluetooth to be communicating their changes between their two devices using that

local communication channel is not allowed because it would undermine the assumptions that are made by the operational transformation algorithm so Ooty requires all of the communication

to go by the central server even if that servers on a different continent and even if the two heaters actually sitting in the same room so this is the difference of where Co deities come in

CR DTS allow this kind of multi-user collaboration without making any assumptions about the network topology so it doesn't assume anything about servers or what kind of network is used

any type of communication channel can be used to communicate the operations or the updates from one user to another and the main correctness criterion that we

have for both ot and Ciardi T's is what we call convergence so what we require is that whenever any two users have seen

the same set of operations then those users must be in the same state and this happens even if the users actually saw the operations in a different order so

what we what we do in CR DT is is we make the operations commutative which means that even if they are swapped around their order is swapped around the end result remains the same and this is

how we can achieve convergence it's just ensuring that after everyone has communicated everyone has the same document on their screen and this makes sense it's of course a very useful property if we did not have this we

would have some kind permanent inconsistency between the users and that would be bad so convergence is clearly a minimum requirement for any of these sort of

collaboration software but as we shall see in this talk convergence by itself is not really enough because convergence doesn't say anything about what that

final state is about after the nodes have communicated so is that final merged state actually the state that we wanted and this is a little bit more

difficult to define because what state is desirable or not desirable is really a matter of human judgment it's what are as humans what is our expectations of

the software of how should the software behave and unfortunately it turns out that several of these IDI algorithms for collaborative editing sometimes have

behavior that is not really desirable that's behaved weirdly in some ways that we as humans would not expect and I have come to the conclusion after several

years of working on Co DTS that basically these problems are quite hard and a simple version of C le T's is very implement very easy to implement but

actually getting it right in such a way that it actually satisfies the user expectations is difficult so C oddities are easy to implement badly and what we

shall have a look at in this talk is how we can implement the oddities well in a way that actually behaves in a way that we want so today I want to look at these

four topics these are all areas in which C oddities are difficult in some way or another and for most of them we have solutions as well so I'm going to just talk through those one by one and the

one that I want to start with is interleaving so interleaving is a problem that can happen in collaborative text editors if several people insert

text at the same position in the document now I'm going to in order to explain why this interleaving happens I'm going to first give a little introduction as to how C oddities for

text editing work and I'm going to start us with an example so as I said earlier when we were looking at operation

formation we identify the position where we want to insert or delete a character by its index simply by counting from the start of the document and that has the problem that those indexes need to be

transformed so see Aditi's avoid the need for this kind of operational transformation by instead taking a different approach instead see our DTS create a unique identifier for every

single character in the document and that unique identifier remains stable it remains the same even if stuff is added and deleted elsewhere in the document so

each character keeps its own permanent unique identifier and there's several possible ways of constructing such identify us one way is we can choose a

number between 0 & 1 so it's some kind of fractional number a rational number and we say for example 0 is the beginning of the document 1 is the end of the document and for every position

somewhere every a character somewhere within a text document we will just pick a number that is appropriately positioned within that number line so here in our example of HDL Oh we could

say okay let's assign the number zero point two to the H the number zero point four to the air II the number zero point six to the Eldar number zero point eight to the O and so

this way we have spaced out our characters over are 0 to 1 number interval and now when we want to insert a second letter L and an exclamation

mark we pick numbers in between the intervals so we want to insert a second L in between the first L and the O so in between zero point six and zero point eight okay let's pick zero point seven

as the midpoint and likewise for the exclamation mark we'd pick a number between zero point eight and one point O so a pig zero point nine for example and

so this is a simple enough scheme and we can use this to implement a CR DT in quite a simple way so what we can now do

is we can think of our our text document as a set of tuples and each of these triples contains the character at a particular position it contains the

numeric value that tells us where this goes and then finally here the aid that I've put in these triples is just a way of referring to the the node

ID that created this particular insertion operation so this is just some kind of unique identifier for a particular user or a particular node in the system and the reason we have that

is just in case two different nodes pick the same number then we can define the ordering on these characters based on

the node ID as a tiebreaker and so we now just have this set of triples representing the document and we can make edits we can insert new characters into this text document just by picking

numbers for them and and then calculating the set Union and we get a set including the new tuples and we can reconstruct exactly what the state of the document is by just sorting those

tuples by their numbers tiebreaking based on node IDs and then we get the text document and several the oddities for text take essentially this approach

they tend to not actually use numbers between 0 and 1 instead they use things like a path through a tree but the effective idea is it's still the same they are just essentially picking

positions on a number line and so when we implement a C or D T in this way we can have some unfortunate behavior and this is an example of behavior that can

occur in some text editing CR duties so here we have two users on the left hand side user 1 who starts off with the text document saying hello exclamation mark

and user 1 inserts the word Alice between hello and the exclamation mark and concurrently on the right hand side user 2 inserts the word charlie

in between hello and the exclamation mark now the two users communicate they merge their changes and we got what we get out at the end is hello and then

some kind of weird jumble of letters that we can't read what what does this even mean how did this happen we started with Alice and Charlie and we ended up with some kind of massive letters well

what happened is this here so we started off with a text document where each each character has a number on the number line

and hello the the Oh had 0.64 and the exclamation our card 0.95 and so then the insertion of Alice or space Alice just choose some chose some numbers in

between that interval zero point six four and zero point nine five and likewise Charlie insertion independently chose some numbers within that interval

and so what a lot of CRT T's do for this kind of thing is they actually randomize the numbers a little bit because that reduces the probability that two

different nodes will pick exactly the same position pick exactly the same number for a particular for a particular character so reduce the risk of

collisions and so what has now happened with this insertion is that well we have taken all of the numbers from Alice on the code which are coded in red here and

the insertions of the Charlie letters which are in blue and we simply ordered them based on the numbers and the result unfortunately is that we've now interleaved those two character sequences that have come from the two

different users and the result is unfortunately a complete unreadable mess now this problem is bad enough if we're

just inserting a single word as we are here it gets even worse if what the two users have done is concurrently inserted an entire paragraph or maybe an entire section into their text document if you

have a paragraph or a section in which the characters have been interleaved on a basically random basis you will not be able to use that text anymore you will just have to delete it and rewrite it again and the users are not going to be

happy about this unfortunately this behavior really does occur in real si oddities and we have examined a number of different at RC OTT

algorithms for text editing and we have demonstrated this problem at least in the algorithms called blue gute and LS EQ and so in these problem in these

algorithms unfortunately this interleaving problem is very deeply baked into the fundamental way how the algorithm is constructed I cannot see any way of

fixing this problem fixing this bug in the algorithms because the way how these numbers are assigned or the waiter paths where the trees are assigned the this

this interleaving idea is just very inherently baked in to the way these algorithms work so other algorithms we have not found this problem so in tree doc and wood for example we did not find

this interleaving problem this does not entirely guarantee that that it can never happen but we think it won't happen and so in a way these algorithms

are safe however unfortunately these two algorithms are also the ones that are least efficient the entire reason why the L seq algorithm was designed for

example is because the designers wanted a CLD T that was more efficient than tree doc and route because tree doc and do unfortunately take a lot of space for their ident for their identifiers and so

unfortunately this performance optimization has now introduced bad behavior in the algorithm and in my opinion that makes the algorithm like LS EQ and la cude basically unsuitable for

use in in many text editing applications moreover there is a specification of collaborative text editing called a strong here which was published by Atia

and others in 2016 and this specification also allows interleaving as a ballad behavior though we have a paper in which we fix this specification

to rule out this interleaving behavior finally our GA is an interesting one so our GA is another text editing CR DT algorithm which does not quite have this

same interleaving problem it has a bit of a less of an interleaving problem and so I will mention that briefly because it's an interesting case study so in our GA the interleaving problem

that can happen is like this so in this example here on the left hand side we've got user one who starts off with the document saying hello then the user inserts the word leader between hello

and the exclamation mark then the user moves their cursor back again after the word hello and types the word dear so what we have here is typing the word read are moving the cursor back and then

typing the word dear on the right hand side we have user 2 who just typed the word Alice between hello and the exclamation mark and what can now happen in RTA is that the word

Alice can get interleaved in between the word dear and the word reader so even though the text dear reader was typed by the one user and the word Alice were

typed by the other user we can end up with the two mixed up a bit so we don't get the same character by character interleaving as we get with some of the other Co duties but we do still have the

ability for text to be inserted into the middle of a different users insertion and the reason this happens is the data structure that RTA uses looks a bit like

this so it's a kind of tree data structure with with very deep sub trees and a small branching factor usually and what happens here is that each node

industry is a character that was inserted and the parent of each node is the character that was the immediate predecessor character at the time of

insertion so we start off with a single special node at the root which is the beginning of the document and then when the user types

h-e-l-l-o exclamation mark each of those letters has as its predecessor the character that was just typed and so you end up with this chain of h-e-l-l-o

luckily like a linked list almost where each characters parent is just its immediate predecessor in this tree and then the users come and position their

cursors in between the hello and the exclamation mark and one type space reader another types space Alice and the

first user who typed reader moved their cursor back and type dear and so you can see every time the cursor gets moved that results in a new sub tree being

formed here and then otherwise the the order of the characters in this document is a depth-first traversal over this tree and so we can see that in the

interleaving that has happened here hello dear Alice reader what has happened is we visit hello then first the dear subtree then the Allis subtree then the winter

subtree and then the exclamation mark and the order in which we visit those sub trees is defined by a time stamp that is attached to every node industry

and so if we have in this case here several siblings so the own ode has several children then we visit those children in descending order of

timestamp and because the time stamps are unpredictable in concurrent editing cases you can end up with these three possible outcomes as the merge results

so it's not as bad as the character by character interleaving that you get with the others I should be precise it is possible to get character by character into leaving

an RG a the circumstances in which that happens is if the users type their document back to front so if the users start at the end of the document and first type the last letter then move

their cursor back to the beginning type the second to last letter move to cursor types third last letter and so on and type the entire document from the end to the beginning if they do that then they

would actually get character by character interleaving in RGA because in that case in this tree here all of our characters would be siblings there would all be children of the root

node and in our case the ordering is just defined by timestamps again but fortunately in practice people tend to not actually type documents like that and people tend to type documents in a

reasonably linear fashion from beginning to end and of course occasionally people will backtrack a bit delete a bit copy and paste some sections and so on but as the general rule

typing still happens in a fashion from beginning to end and so for that reason the interleaving problem is not actually as bad in our GA as it is in the other algorithms

however we could still say we want to be strict and we want to not allow any interleaving at all and we did actually devise an algorithm so for our GA we

were able to find a fix a modification to the algorithm which changes it such that this kind of interleaving of concurrently insertion does not happen regardless of how the cursors was moved

I don't have time in this talk to talk about this particular rhythm but if you're interested there's a paper that we published in 2019 called interleaving anomalies in collaborative

text editors and in this paper the algorithm is described exactly so let's

move on to our next topic our next topic today is reordering list items that is we have a list of things and we want to

move an item from one position in the list to another so one example in which an example application in which we might want this is a to-do list so in a to-do

list often you can drag and drop items into whatever order you want and so what happens here for example is you have a to-do list consisting of firstly by milk

secondly water the plants thirdly phone Joe and the user is dragging and dropping the phone Joe item and moving it to the top of the list so that this

is now at position one so how do we do this well we can represent the list using any of the CEO duties that we talked about earlier so all of the CEO duties for text editing are also Co

duties for lists and so we can use those to represent this ordered list of items however all of those CR duties don't actually support a move operation all

they have is operations for inserting and deleting items but no operation for moving now you might think ok well that's not too bad because we can always

implement moving by just deleting the item from its old location and inserting it at its new location and this works except it has a problem if several

people can currently move the same item so in this example here replica a and replica B both move phone Joe to the top

of the list and so what happens if we implement this moving by deleting and reinserting well they both delete phone Joe from position 3 so it's definitely

gone from position 3 and then they both insert phone Joe at position 1 but the result now is that these two insertions at position 1 they both happen and so we

end up with two copies of phone Joe at position 1 & 2 so this item that we wanted to only move has accidentally got duplicated in the process and this is really not the kind

of behavior that a user would expect I think over to do list application just because they reordered some items on their list doesn't mean they certainly want to copies of those items appearing

so well what is the behavior actually does we want what is the desirable behavior let's look at a similar example here in this case we've also got two

replicas moving phone Joe but we've got the moving that item to different positions so replica a moves phone Joe to the top of the list whereas a replica

B moves phone Joe to be at the second position so after buying milk so what is the desirable behavior in this case well we could say that now our phone Joe should appear in both positions before

buy milk and after buy milk but then we've got duplication again and we did not want duplication so really what we want is phone Joe to appear in one of the two positions I guess that maybe

show a warning saying hey use odd as a warning we we picked one of the two positions or maybe it's just fine to just pick one of the two and forget that there were two different list positions

so I'm just going to say now let's we're going to pick one of the two concurrent positions as arbitrarily but deterministically as the winner and so

in this case example here we've got phone Joe being at the top of the list as the winner so from the point of view of the first user nothing changes but from the point of view of the second

user after they've moved from Joe to position two then additionally the phone Joe is then subsequently moved to

position one and now let's think about what is happening here so we've got two concurrently written values for the two different positions and we want to deterministic ly

but arbitrarily but deterministically pick one of them as the result as the winner as the outcome and this may look familiar to you if you have studied

existing CEO duties because it's just what happens in a last writer winds register in the last writer in red we've got except a register which is a

variable effectively which can concurrently have a value assigned to it by different users and if we have those concurrent assignments the end result

after everybody is merged their state is that we pick one of those values that has been written as the winner and we throw away any other values that have been written concurrently and so you can

think of here as the position of the list item phone Joe as having a last writer wins register semantics to it

that is first one user assigns the position of phone Joe to the head of the list the other user assigns the position of fern Joe to be after by milk and then

as the two users communicate then the end result is the winner is one of those two values either head of the list or after by milk so let's make this a

little bit more formal what we need is if we want to implement this here we need a last write of wins register okay second we need some kind of an ambiguous

way of referencing positions in the list but if you think about what we've just done earlier when we were talking about unique identifiers for the characters in a text document what the lists the

oddities and these text see oddities already doing is providing us with a stable and unique way of referencing particular positions in a list so we can just use one of these existing

algorithms and use its unique identifiers for list elements in order to reference the position of a particular item endless now difference er duties have different ways of

constructing those identifiers as we saw like tree dock uses a path through a binary tree log ooh chooses a path through a tree with a higher branching factor RGA and causal trees use some

kind of time stamps but the basic idea is still the same we've got unique identifiers for particular positions in the list and so we can take those identifiers and put them as the value

inside our last strata wins register secondly what we need now if we've got several list items is we need a separate register for each list item to hold the

for that particular list item and so well we can just construct a set we can for example use a a twin set set CR DT and the contents of that set are going

to be all of the items on our list and so we're going to say to each item on the list is going to be a pair consisting of a value which is like the text on the to-do list item for example

and they last writer wins register containing the position of that particular list item and so now we can add new items to the list by adding them

to the add win set we can move position from one position to another by updating the value of their last writer wins register in particular we use a list C R

DT to create a position identifier for the location where we want to move a particular item and then we take that position identifier and use it as the value in the last write wins register so

here we've constructed a new C R DT operation now a move operation for lists just by using three existing CRD T's any of this this range of different lists

see our duties plus an ad win set plus a last writer wins register and we've constructed a new operation and because we've just composed these existing C our duties we know that the end result is

also going to be a CLE T so this is very nice and it allows us to do these atomic moves of a single list item at a time now things get a little bit more

difficult if we want to move from more than one item at a time and so in a to-do list you really only need to move one item at a time because they typically drag and drop a list lets you

pick up one item at a time in the user interface so that's not really a problem but in text editing you could imagine other circumstances arising so in this

example here we've got a to-do list this time represented just as plain text as a list of bullet points wage bullet point starts with a bullet point character and

ends with a newline character and so here replica B is taking the item milk and wants to move that in front of bacon and so that means taking all of the

characters taking a character bullet point space m IL k new line the whole range of characters are moved a whole range to be in front of bacon

concurrently with that happening a different user is updating the milk item to say soy milk and so they do that by deleting the uppercase M and inserting

the text soy and the lowercase M okay so what do we expect to have happen here so firstly the soy milk sorry to change

from milk to soy milk takes effect and the move from milk to be in front of bacon takes effect so we would expect the desired outcome here is that we have soy milk first in the list in front of

bacon so that means that two changes have become merged together cleanly this is what we want to have happen unfortunately it's not what actually happens so if we look at the algorithm

that we just constructed for moving a single list item and if we just move each character individually according to this algorithm the end result that we get is this here so we end up with milk

is correctly moved in front of bacon that's happens exactly as we wanted but the change from milk to soy milk remains attached to the old position of where

milk was previously not the new position where milk has now been moved to and so the end result now is that the list reads milk new line bacon new line soy M

and that soy M is just standing there without any context because its context has been moved away in the meantime and this is a problem that I do not know how to solve at the time for the time being

I've spent a while thinking about it come up with a couple of half solutions that don't really work a couple of other people are thinking about it as well so if you're interested in this feel free

think about it and do publish the result if you manage to figure it out so this is the problem of moving items in lists so we've figured out how to move a

single list item at a time but these ranges are characters moving them in a way that behaves nicely is still an open challenge let's move on to point three which is

again about moving but now rather than moving elements in lists we're going to move sub trees in a tree so let's again look at an example let's say we have a

tree structure which might be set a file system and like in the case of moving elements in lists we can think about what happens if the same node in a tree

is concurrently moved by two different users to two different locations and so you can see here we start off with the tree on both replicas where a B and C

are children of the root and on replicas a we're going to take that that node a and move it to be a child of B whereas

on replicas two we're going to take the same node a and move it to be a child of C so now a has been moved to two different locations on two different

replicas in this case what do we expect the outcome to be and it's kind of similar to the case of moving elements in lists so one option is to duplicate

it and so what we can have is that we have a appearing as a child of B and also a appearing as a child of C this means then further any children of a

will also have to be duplicated and that's what you get here in the in this box labeled a that is you have the the a

subtree with a 1 a 2 a 3 children and then another whole copy of the whole subtree so this is duplication like in the case of the lists I think

duplicating nodes on concurrent news is not a good behavior so we don't want that well another option is for a to be

actually shared as a common child of both B and C this works but it's no longer a tree so if what we expect our data structure to be is a tree rather

than a dag or some other kind of graph then this is not an acceptable outcome because here a has two parents and in a tree no node ever has two parents so

that leaves C and D as the post outcomes see is simply we take the replica one operation as our outcome and

ignored the move made by replica two or vice versa we can choose replica two to be the winner and to ignore the move made by replica one so in this case

either a a pipe appears as a child of B or a appears as a child of C but not both so like in the case of of moving

atoms in lists what we have is a kind of last writer winds behavior here and I think that is reasonable now in trees additional complications appear and so

trees are definitely a harder case than lists and so another example of tricky things that can happen in trees can be illustrated with your file system so if

you want you can try this on your computer and just create a directory called a and then create a sub directory within a called B and then try moving a

into B so move a to be a child of a slash B this may sound weird what you are doing here is effectively trying to

move a directory into itself I don't know if you've ever tried this I did try it on Mac OS I get this response here from that from the shell saying invalid

argument I'm not allowed to do this movie and this makes sense because if we did perform a move where we move a directory into itself we would kind of be creating a cycle and then this would

our data structure would not no longer be a tree it would be some kind of graph with cycles in it and this would be very confusing for a file system so if we build see oddities or trees it also

makes sense that if we allow moving sub trees we also prevent this kind of cycle and prevent somebody moving an object inside itself but unfortunately with C

oddities we have the additional complication of concurrent changes so consider this example here where we start off with the same tree again on

both replicas on replica one we take the node a sorry we take the node B and move it to be a child of a so here B

appears as a child of a while C reigns as an existing child of a on replica two we start off with the same tree and in this case we move a we move a to be a

child of B so we end up here with a as a child of B and C as a child of a as before what happens now if we merge these two so we've got here to move

operations each of which by itself is perfectly fine there's nothing wrong with moving B to be a child of a and there's nothing wrong with moving a to be a trial to B but if we combine both

of these we end up with exactly the same problem as we just had we're effectively one directory gets moved inside itself and so if we are not careful with this

algorithm we end up with a and B in this kind of cycle disconnected from the root of the tree and well this is again no longer a tree it's some kind of more general graph data structure so we don't

want this if we want to preserve the factor of grandpa tree another work option as before is we can duplicate nodes so we can say that well we move B

to be a child of a and we move a to be a child of B so we have both options both B as a child of a and a as a child of the book that simply exists in in our

directory graph in our tree and so again we have to duplicate any of the children of those and as before again I think duplication is the wrong way of doing this so it seems as before what we

really want is a last writer wins semantics here where either this option here that is the outcome of moving B

into a and ignoring the other move or we choose the outcome of moving a into B and ignoring the other move so in this case we want to pick one of the two

conflicting operations and we want to just ignore the other one what we have to ensure that all of the replicas end up picking the same operation as the winner and ignoring the same other

operation because of course otherwise they weren't converged and then it's not a CR DT so how do we actually achieve this now where we want to ensure that all of these

conflicting operations exactly one gets picked well I don't know we can try some existing software I tried this with Google Drive for example and I got this

wonderful error message here I tried exactly just having two directories a and B move a into B and move me into a concurrently on two different computers

what happens I get some kind of internal unknown error try again but this error never goes away it just keeps sitting down and loop trying to sync yeah so Google Drive

hasn't solved this problem either okay I guess we'll have to think about it for ourselves so the way that we solve this is by thinking about the sequence of

operations that get applied on each replica and having a time stamp on each operation and we're just going to assume some globally unique timestamp as usual so you can use Lamport time stamps for

example that works fine and of these operations some of them are going to be move operations and some might be some other kind of operations and here we

have on replica one we have moving a into B and moving B into a on replica two and as I said before each of these operations by itself is safe and fine to

execute what we need to ensure that we detect the case when two operations are in conflict with each other in the sense that executing both of the operations

would create a cycle so we have to ensure that no matter what we do we do not create a cycle now what can happen now if these two replicas merge there

are sequences of operations is well we take the union of the operations from both of the replicas we can put them in

time snap order and now if we say that the operations get executed exactly in increasing time stamp order then the first move is fine because by this point

nothing bad has happened yet but then by the time we get to the second move operation that move is unsafe and so if we want to ensure that there are no cycles in our tree then we have to skip

that operation just pretend it didn't happen this is our tricky because replica 2 has already executed this operation so replica 2

will sell somehow have to undo this operation that it is really performed in order to ensure that it converges again with replica 1 so the basic principle

we're going to use is to take these sequences of operations where we have a time stamp on each operation and we're going to merge them such that we get a

single sequence of operations in time stamp order and so for example here on replica 1 we have time stamps 1 3 4 and 5 on replica 2 we have time stamps 2 6 7

& 8 and so if we imagine your replica one and you've executed 1 3 4 5 and now you receive the operation with time stamp -

from replicas - well what you need to do is you need to take this time stamp - operation and insert it somewhere earlier into your sequence of operations

that you've executed but you know time has already moved on by this point and we've really executed the operations with time stamp 3 4 & 5 so how do we handle this well we can just take a

fairly simple approach which is to undo and redo operations and this is exactly what we do in our algorithm and so that is if you are replica one here that wants to apply operation - and with time

stamp - you've got your log of operations that you've performed where x x 1 3 4 & 5 and we're first going to undo operations until we are back at the

point until we've undone all of the operations with time stamps greater than the new operation so that is we're going to undo up 5 we're going to undo the move we're going to undo up 3 now we're

only left with off 1 so now we can apply up to on top of that with time stamp - and now we we do the three operations that we've undone and the end result is now that we have this log of operations

we're up to has been inserted in the right place and if we keep performing this kind of logic on each of the replicas then this will actually ensure that all of the

replicas end up with the same operation as exit shoot it in the same order just means we have a bit of extra work to do because if some operation needs to be inserted

somewhere quite early on in the order we have to do a lot of undos to do that one operation and then a lot of reduce to replay all of the operations that have happened so you might wonder what is the performance of this isn't the

performance going to be terrible well we did some performance experiments and it was okay it was of course there was certainly a cost in order of performing

all of these undos and redos and you can see that here so we set up a system with three replicas on three different

continents and so there was a delay of over a hundred milliseconds round-trip time between these different replicas and we then just started generating move

operations at a high rate and the higher the rate of move operations in this system the more undo and redo these need to be done and so the slower it becomes to execute each individual operation and

we were able to get this system to at least 600 operations per second which you know is it's not massive on sort of big data scales but on the other hand

for like a single user updating their file system I think I think a single user is not going to perform more than 600 move operations per second on their own local file systems so for a lot of

collaboration software actually this performance is it's perfectly fine and we don't have to worry too much so I will explain a little bit more about how

this how this algorithm works in terms of the data structures that we use and so we have to first find a way of describing the move operation that we

want to perform we're going to say move operation is a structure here with the following fields first of all like I said earlier and weave operation has a globally unique time stamps such as a LAN port time stamp then a move

operation has got to have a child which is the the node in the tree that we are moving and the parent which is the new location the new parent of that child in

the tree now we are not going to note in this operation here what the old parent of this child was so this is like we can call this stealing move that is we're simply going

to take child from wherever it currently is in the tree and move it to be a parent of this new parent here and further we can have a bit of metadata associated with the operation so for

example in a file system you have at one file within a directory and the file has a name within the directory and so that can be the metadata that is associated

with a particular file in a particular directory and then we have to construct this log of operations in order to perform the undo and redo and in order

to perform this undo we need to associate a little bit of extra data with each entry and the log so each entry here has several fields that come

straight from the move operations so the timestamp is from the move operation as before the new parent the new metadata and the child these come directly from the move operations and so in addition

to these we've now also recalled the old parent and the old metadata so that is what was the parent of child before we performed this move and was what was its

old file name for example if we're using the metadata for file names and so now this allows us to perform the undo because what we do in order to undo this

operation is just to set child's parent and metadata back to its old parent and it's old metadata and given this log of

log entries of move operations we can now construct the current state of the tree the tree is just a set of triples of current metadata and trial triples

whenever child is a parent or parent in the tree and so we can now define this ancestor relation here so we say that a is ancestor of B if either a is a direct

parent of tree as in there's a comma M comma B triple exists in the tree or if there exists some tree node C such that

a is a parent of C and C is an ancestor of me okay so this defines just a transitive closure over the parent-child relationship

and so given this definition of ancestry now we can define what it makes for a move operation to be safe so for a move operation if we want to

perform this move we look at the child that is being moved and the parent which is the destination of where the child is being moved to and if the child is

currently already an ancestor of the destination where it's being moved to then this would be unsafe we would create a cycle so we do nothing similarly if the child and the parent

are in fact the same node that would also be unsafe so again we do nothing but in all other cases it's fine so we

can update the tree by removing from the tree the node the existing location of C that is this is exactly what I said we steal the child node from wherever it

currently is in the tree for any peak on Miami here and then we add this new parent metadata child relationship to the tree and this defines our updated

state of the tree and so given this function here for performing a move operation and given the undo/redo procedure that I explained previously

we now have an algorithm for safely performing move operations on a tree C R DT and we proved several theorems about this algorithm so in particular we

showed that this algorithm does indeed preserve the tree structure so to preserve a tree structure what we need is that every node has a unique parent and that the tree contains no cycles

that is the definition of a tree and so we prove that these things hold so for any sequence of operations that are executed the data structure remains a tree and then obviously we have to prove

that it's a CR DT which means that it converges so given any two sequences of operations on this tree if one sequence of operation is a

permutation of the other sequence of operations then applying those sequences of operations leads to the same state in other words operations are commutative

so this algorithm works and we have implemented and and proved these properties about performing move opera

Shinzon trees okay moving on from moving subtrees NCR duties let's talk about efficiency and making co DTS more

efficient so one particular problem that occurs with a lot of CDT algorithms especially the text editing CR DTS is the overhead that the metadata the CR DT

metadata incurs so if you think about a text document what we said is we give every single character and a text document a unique identifier and so what you end up with is each character in

English text would be one bite for the ASCII character the unicode utf-8 encoding of that character but then you have additional metadata so you might have for example some path through a

tree which might take a couple of dozen bytes to encode then you might have some identifier for the node that inserted a particular character which if that is a

UID will be at least another 16 bytes if you encode it in hexadecimal it'll be like 36 bytes or something like that and so very quickly you end up with this really disproportionate situation

where you have one byte for the actual data that you want and something like a hundred bytes of metadata or even more than that just for the CR DT metadata in

order to make the CDT work and that's not a particularly great situation to be in so what I would like to talk about now is some work that we've been doing as

part of the Auto merge project which is a CR DT implementation that I work on in order to bring down that metadata overhead and we've had some really good

successful results by using some carefully designed algorithms and data structures so let me start with the results of what we have here so the old

version of Otto emerge uses a JSON encoding in order to encode all of the metadata and send it over the network and write it to disk and this JSON

encoding is incredibly verbose so what we've done here is as an example to take a particular case study which is the editing history of the latex source of a paper

so in order to generate this editing history a colleague and I actually wrote this paper using a homegrown text editor and so we were able to capture every single keystroke that went into the

writing of this paper so the final file is about 100 kilobytes of plain text so this is just ASCII text containing latex

markup and we captured over 300,000 operations that is the in 300,000 individual keystrokes of the editing history going into that and so that

includes all of the individual characters that were typed in order to create this document all of the characters that were deleted for any stuff that was added or removed or any typos that were corrected and so on and

we also recorded all of the cursor movements so every single time the cursor was moved from one position and document to a different one that was

also recorded and so if you take this history of 300 thousands or so changes and you encode it in the current or to

merge data format in this JSON data format the file size is about a hundred and fifty megabytes so that is we're

using almost 500 bytes per change here which is which is pretty terrible and so you can see now in the in the new binary

data format that we've designed in order to bring down that over metadata overhead we've actually made some some really big improvements on that and so

what we have done is first of all designed a binary format that encodes the full editing history including every single insertion and deletion and

including when that happened and encodes that in a compact form so it doesn't lose any information at all this is a completely lossless encoding and what

we've managed to do is reduce this file size by over a factor of 200 down to about 700 kilobytes containing exactly the same information just encoded

differently so I should emphasize if you would just take the JSON history and encode it using one of the binary equivalents for Jason you would not anywhere near this so if you use

protocol buffers or message pack or something like that you know you you might reduce it by a factor of two or three at the most

but in order to get this factor of 200 improvement you have to design data format doses specific to see oddities and specific to the particular heading

patterns that tend to occur in these sorts of applications now you concede that by gzipping these files you can reduce it so you can reduce 150

megabytes down to about 6 megabytes but the 6 megabytes are still huge compared with the 700 kilobytes that we were achieved uncompressed in our binary

format and that still does gzip down further to about 300 kilobytes without losing any information at all so note

that by this point here now we've been able to Inc encode a change history the change history consists of over 300,000 changes we've encoded that in 300

kilobytes so we're now actually using less than one byte per change in order to encode the full change history so we can look back at any past version of

this document week and if any two versions at any points in time and all of that data is it's still there encoded in this very compact form which I find very exciting now of course we can still

compress it further by throwing away some of the history so one of the first things we might choose to throw away is the cursor movements because to be honest who cares about exactly where the cursor

was at which point in the editing history so we can save about 22% by throwing away two cursor movements and

we can we can go down to about 230 kilobytes by also throwing away all of the change history for the text so by this point here where we're at 230

kilobytes we still have all of the CR DT metadata including all of the tombstones so this means at this point we're still able to

perform arbitrary merges between different replicas of this data and yet we are down to only about twice the

or data set size so about 100 kilobytes is the raw text document with no CR DT metadata at all that's just the final state of the text document and all of

the CR DT metadata is adding only a bit more than 100 percent overhead to this without gzip if we actually apply gzip then that file sizes are almost the same

so the the file containing 2 CR DT metadata gzipped is about the same size as the plain text file non gzipped now

we can get even further down by removing the tombstones the tombstones are the markers for deleted characters in the

document and if we do so then we save what is that another 70 kilobytes or something like that for the tombstones whether you want to do that or not

depends because if you throw away their tombstones it means you can no longer merge with edits that happened concurrently so any edits that happen before that to stone removal or

concurrently with the tombstone removal now can't be merged anymore but but if you do remove the tombstones then just the raw CR DT metadata just that is just

the unique identifiers for the characters at that point takes only 50 kilobytes so it's only about 48 percent overhead on top of the the raw text

document using using our data format very exciting I find this so I should tell you a bit about how this actually works and how we were able to achieve

such good compression the basic idea is as I said we want to still maintain all of the change history as much as possible and so we're going to store the

full set of all of the insertion operations all the deletion operations and all of the custom movements if necessary each operation is given a unique ID which is just the lamp or timestamp and I'm poor timestamp is just

a pair of a counter and the ID of the node generated it or we call it actor ID in auto mode and the the algorithm that we use for text editing is based on CR

DT so it's based on whenever you perform an insertion you reference the in the identifier that is the lamp or timestamp of the character

that is immediately before the inserted position and we keep this set of all of the operations and stored in document order that it's in in the order in which

the characters appear in the document and what we get is this kind of table here so you can think of this almost like a table in a relational database

where each row is one operation and so you can see that is typing h-e-l-l-o so it's hello were three L's and then

somebody came along and deleted one of the three L's so that the final result is just hello and so the way this is encoded is each row here is a is an

insertion operation into this document and you can see that each row has an operation ID which consists or which is a lamp or timestamp so it consists of a counter that just increments and the

actor ID I just wrote a here but in reality our actor IDs are actually new IDs so they're actually like 16 bytes or even 32 bytes long and and so we have

this unique identifier for each for each character dispenser that was inserted here we have the reference element that is the predecessor character the idea the predecessor character and so this is

just saying here the e comes after the H so the H has ID 1a the e has ID 2a and comes after 1a and so we can then encode

all of the operations this way furthermore if an operation is marked as deleted we do that by having these additional deletion columns that we can

regroup sorry record the operation ID of the operation that deleted this particular character in fact it could happen that there are multiple deletion operations for the same character if

multiple users deleted the same character and now we can take this table and include it in a very compact form using some fairly simple tricks so we'll

go to first of all encode each column of this table individually and so for example we can look at this first column the counter call which consists of two numbers one two three four five six how do we encode

that efficiently well first of all we Delta encode it which means for each number we calculate the difference between that number and its predecessor so one two three four five six the difference between any two subsequent

numbers is one so this delta encodes 2 1 1 1 1 1 1 which we can then one length encode to 6 times the number 1 and then we can use a variable length in binary

encoding for integers to encode 6 comma 1 in 2 bytes so this variable length encoding just ensures that small numbers get represented in one byte slightly bigger numbers get represented

in 2 bytes bigger numbers still get represented in 3 bytes and so on and so here we've now encoded this entire counter column here in 2 bytes similarly

we can proceed to the actor ID column so for this actor column well first of all if we make a lookup table we don't need to keep repeating the huge IDs so we

just map a to 0 B to 1 for example and so now this translates into 0 0 0 0 0 0 we can run length encoding use the variable length

binary encoding again running again we've encoded this entire column in 2 bytes moving on let's look at this actual text column here what I've done

actually represented the text in two separate columns so here this is the the utf-8 byte string for that particular character that is being inserted and in

this length column here we have the number of bytes for that particular Unicode character so if what we're doing is just typing English text then every

single character is going to be within the 1 byte encoding of utf-8 and so this column here is just going to consists of lots of ones it's just all of the

characters have length 1 so that encodes again super compactly and now as we've encoded the length columns for the actual characters we can just concatenate them because we can still

then later split out where the boundaries are between the characters by just looking at the length column and so we're simply going to concatenate all of the bytes in the utf-8 column hello with

triple L which is encoded in six bytes so what we've done here just by encoding each column individually we can represent each

column extremely compactly in just a couple of bites and it turns out then that because of the editing patterns that you tend to get with text you tend

to get these increasing sequences because people just tend to type letters in sequential order which means you get to use increasing sequences of counters

that always increase by one each time so the whole data has a structure which compresses really nicely and we can take advantage of our knowledge of this

structure to encode it very compactly so this gives us the set of operations now in order to have the full change history we need some little bit of extra

metadata that is for each set of changes that were made at some point in time we need the timestamp of when that was occurred and the range of operation IDs and with this extra metadata we can now

reconstruct exactly what the document looked like at every passed point in time even while using a very small

amount of additional storage space so if people tell you that with CRD T's you know you have to put effort into tombstone collection because tombstones waste a lot of space you know just think

back at this table here the tombstones actually just cost us 48 percent of the space it's a fairly small amount of costs to to maintain here whereas

choosing the difference between a simple JSON format and an optimized binary format made a factor of 200 difference

so I think this means we have the ability if we want to store the full change history and to allow us to do you know dips and interesting visualizations

or to change history for our documents with very low cost very low cost and we don't actually have to throw away all of

this extra metadata that we have so this brings me to the end as I said at the beginning I think C oddities are very easy to implement badly so they're very easy to implement

in a way that is inefficient and that has all sorts of weird behaviors like the interleaving anomaly that we saw and they're all these problems of how do we

do a move operation both on lists and on trees in a way that is safe and that is recently intuitive to our users but as you've seen from this talk we have made

a lot of progress in just the last few years on these topics and I think it's going in the direction we're really see oddities are starting to become something that can be really the

foundation for a very large set of applications so we can start to build like actual real practical user software not just research prototypes on top of

these these basic principles so I hope you will join me in finding this very interesting and exploring where this ends up going in the future so I will

end with some references that is here first of all the list of the various text editing CRD T's that I talked about in the context of interleaving and

finally the publications that I've worked on for the various algorithms and anomalies and problems that we talked about in the context of this talk all of

the slides are online so you can find them there thank you very much for listening I hope you enjoyed this talk and I'll see you again soon

Loading...

Loading video analysis...