Workshop on Zhang and Shasha Tree Edit Disance algorithm + alignment

By HectorHugoFrancoPeny

Summary

## Key takeaways - **Leen Stein distance via DP matrix**: The Leen Stein distance between two strings is computed by filling a dynamic programming table, initializing the first row and column with ascending integers and using the Leen Stein formula; the total distance appears at the bottom‑right cell of the matrix. [02:09] - **Forest tables solve tree edit distance**: The Zhang–Shasha algorithm evaluates tree edit distance by building forest DP tables and considering three elementary operations—deleting the source root, inserting the target root, or swapping the two roots—selecting the cheapest combination. [15:20] - **Tree mapping constraints preserve order, ancestry, one-to-one**: A valid tree mapping must preserve right‑to‑left sibling order, keep ancestor‑descendant relationships unchanged, and map each node to at most one node. [06:10] - **Key roots and most left descendants structure DP**: Key roots and most‑left descendants partition the tree into independent sub‑problems, allowing the DP tables to be filled efficiently. [08:10] - **Backtracking recovers alignment from DP tables**: After the DP tables are filled, back‑tracking the cells tells whether each step was a swap, insertion, or deletion, recovering the full alignment. [30:01] - **Zhang-Shasha efficient, matches Leen Stein on linear trees**: The Zhang–Shasha algorithm runs in polynomial time and space, and when both trees are linear (a single path) its forest table collapses to the ordinary Leen Stein matrix. [29:50]

Topics Covered

Valid Tree Mappings Preserve Ancestry and Order
Key Roots Anchor Tree Structure Comparisons
Forest Decomposition Solves Tree Edit Distance
Tree Edit Distance Generalizes String Edit Distance
Tree Distance Achieves Efficient Computation and Alignment

Full Transcript

workshop on three edit disc presented by Hector Franco you can find me at Franco at tcd.ie

at tcd.ie let's start seeing the Leen Stein algorithm and the sub string matching algorithm the string distance also know

as the Leen Stein algorithm it's a precursor of the three edit distance and it measure the shortest sequence of

command that transform a string s into a string T and what it does is to count the

amount of operations of uh swapping one character to another deleting uh one character in s or

inserting uh one character in t for instance uh these two family names uh

have a distance of two because the letter L um have to

be um inserted on the second on on T and the letter e have to be swapped into to

O So only two uh modifications Computing the L Stein distance it's made by uh calculating uh

intermediate cost of of matching uh sub strings of the first uh string to the

second by that always in include from the first to uh a certain amount of uh characters [Music]

so the way to do to do it for these two uh names is to uh create a table fill

the first row and column by ascendant uh numbers and then use uh the Leen Stein formula to

um fill the rest of the table and at the end we will find the distance at the bottom uh right corner of the Matrix

which mean the cost of matching the whole string to the whole string of the other side yeah that's the

distance now once we create the M The Matrix we can also uh back track the the numbers

and obtain in this way the mapping so here we can see that n match with n without any

changes so uh in both both cases uh and it match to

zero here we can imagine that um e and H doesn't match so what we did is to insert the letter uh

e here we have two possibilities because the C can match uh the first or the second

letter uh so we have two possible uh mappings for the same distance now I would like you to practice it it on your own let's imagine

the example between G and good you should be already able to fill the first row and column with ascendant numbers and probably you already have

the rest of the numbers and again in this case there is two possibilities because God have one o and good have two

so you can match in any place and the cost is one because in any case you have to insert an O how about sub string matching

mean how about having one a string in this case a universe finding a sub well the algorithm will be the same but the

first row on on universe will be all seros and that's that means that we can delete as many letters from the

beginning of uh of Universe for free and the answer will be not on the bottom right corner but on the bottom any of

the values uh will be good what means that we can delete uh characters from the from the back uh

of the string for free and we can uh get the alignment just in the same way now let's go to three addit

distance the three addit distance uh was proposed to solve what is the minimal type mapping and some shha propose an

algorithm to do it so a Time mapping is a mapping in which uh the right to left order is

preserved so siblings cannot uh swap order or the ancestry uh relation it's preserved so if one node is an ancestor

of another that relation should be kept after the match and also so one node only can

match a s another node or n but you cannot match one node into another two nodes now let's pass to the basic uh

Concepts before starting explaining the the algorithm let's start by what is a post uh postor traversal postor

the way to numerate the notes of a three inra verial postorder is by first leveling the left side then the right side and then the node

itself please try to practice yourself we have to label this three so probably you already guess it

right right side then repeat the algorithm again and so on and these are all the ancestors for a

given uh node which is very intuitive you also need the concept of the most left descent

so for each of these groups the most left descendence is the one at the at the bottom

it doesn't have to be graphically uh on the left but means the most the m means the the

fast descendant as much as descendant you can find now you also need the concept of the key uh roots and is for the same

groups that we made before um is the the higher ancestor so somehow for each group of most left descendants there is also

there is only one most left descendant and one key root and yeah one key root could be also the

most left descendant now let's go to Maps the concept of a mapping is very easy to understand there are notes that are deleted there are notes that are

inserted and there are notes that are perfectly swapped without changing the content and others are match but swapping the

cont please note that this is not a Time M so it's just an illustration of a

map and what will be the transformation so with this definition we have four sets um of

uh of swaps so they deleted they inserted the noes that are swapped that could be perfectly swapped or could be swapped with a with a

change a way to define mappings in a more formal way that the mapping is in this case it only refers to the

swaps so s is the source T is the Target and the map are pairs of note between the source and the target let's see the time mappings again

one: one sibling order preserved on store order preser so let's try to

practice let's guess if the first mapping this one here it's a tie mapping or not

ready and the answer is no this mapping is not preserving the ancestry uh relation because

sorry because this note here was not an ancestor of this one but after the transformation uh it become an ancestor

and this is not allow can you guess for the second map probably you already C it the answer

is not again because you cannot have multiple uh one node match to multiple noes can you guess the third

one well the third one it's a possible time mapping it it might be not the last cross mapping but

indeed is a possible time up and can you guess the fourth one again is not a Time mapping because

siblings is few siblings are swapping the order so the left to right restriction is not

preserved and now let's go to to the three distance algor and it could be multiple time mappings between two threes

but the three distance is the distance of the less expensive possible time

up and the cost is the is defined as the number of insertion deletions or uh swaps which again uh it can vary the

cost from one implementation to another now I want you to try to prove the triangle inequality just as an

exercise knowing that insertions deletions and swap have the same uh cost well I just want to mention that all

possible combinations of uh sets lead to the same conclusion that the triangle inequality have to be preced and now let's jump to the most

interesting point and probably that's why you are watching this video the s Shasha algorithm and how to produce the alignment well the first thing to know

in this algorithm is that there is a three table and there is a forest table sorry a set of forest

tables so the forest distance is well you know it's the distance between

a set of Threes to another set of Threes And we are going to designate it by this Lambda

so three distance is the minimal distance between deleting the first node and swapping the

forest to the second Tre uh inserting the first uh node

of the second tree and later matching adding this cones to the Tre to the forest of what we have left or just to

swap the cost of swapping uh both uh root nodes and then uh matching both uh trees both

Forest but the forest definition have it's a slightly different I mean it's exactly the same for inserting or deleting but in order to preserve uh the

right to the ancestry relation then um they have not this only swap

if the rest of the of that particular tree for which they are key Roots is also

preserved so this case we chop off the the the last three I think I already mentioned this that this is not allow to check the uh

to make sure that the ancestry relation it's a Reser and as you can see this is a possible

uh Mapp I mean a possible mapping that is not a Time mapping and this is what could uh happen and this is the SEO code for the

three distance algorithm so for HQ root

uh of the first threee and the second for each pair of uh key roots in traversal post

order um well we calculate the three edit distance and to calculate the three edit distance um we

create a a forest table um and to know how big should be we

create these bones and then we create the three uh the forest

table and again we initialize the rows and columns with uh the value of with an

ascendant value which in this case should be 0 1 2 3 until whatever number or whatever is the cost of

deleting um one element after another which we assume previously that is always one and then when we already initialize

the forest table we go for each uh left most left position of each uh

node and are equal then well we calculate the minimal

uh Forest distance between the three possibilities uh we already um inspect

before and if not it means that we just use or

uh the the uh the forest or the tree distance and again this is uh to calculate that if it's a three

then well U it all in case of sorry let me go

back in in this case we have to calculate to get the three distance that should be PR

calculated and another another forx distance which is this this case here let's try to see with a sample as

more graphical as possible in in this case let's imagine that the colors means uh different uh different

labeles so we have three different colors and we have uh two3 the the border of the notes is different just to

differentiate both this and now let's imagine that the most left array is

something like sry for the first three is one one

then the Third is itself the fourth is itself for the fifth note

is um one as well and for the six node the most left is one as well and this

the key Ro the key Roots well will be uh this not is not a key root or one

meaning this node is a key roote so in this this case the key roots are 3 4 and six again the most left

notes and the key roots and now this is the Table of atomic cost which is basically checking

the label and if the labels are the same then the cost is zero but if the label are different then the cost is

one and now let's imagine how to fill uh the table so the First Column is

useless because we start numerating by one then let's suppose that we initialize the values of the

Matrix with n so not not a number just to make easy to know which are our real values and what are not and in this case

The Matrix is 6 by six because that are the amount of notes of both three and

now we try to calculate the three distance between uh the sub three uh three with the root node three and four

because they are uh the small the first ones on uh the list and because the trees only have one

note it seems very simple so we can delete them or to swap

and we swap so we add in the three distance

table a z because the cost of these two sub threes swapping them is z

now let's see we repeat calculating uh for this a three containing three and

containing four and three and well the the way was to uh both are three so

we will fill the three table and using our forest table uh we know

that swapping three and and three cost zero so why not another extra cost of

one for swap for inserting four which happened to be well the the winner

and now we pick the roots three and five and repeat and there are just two notes which is very easy again and we get one

and again we update the table and now we pick the roots

three the the root node three of the first three and the root node six of the second three which is a lot more complex because it includes six nodes under

itself as the S so we are trying to swap one three of one note to a three with six

no so now we try to swap three and one which is the most

left note of the sub three with the root uh six and is a tree so we have to update the

table and the calculation is very easy again swapping is the best choice and now for the second three we we compare

to match the no three with the three containing one two and one and again it's very easy to operate once we know the previous result and

using our forest three table we got that the cost is one because we just insert the node two

and in this case it's important to know that we cannot look in diagonal because this now what we are trying to compare

is not a tree into a tree per case it have to come from the three uh edit distance uh to to the from the three

table and again in this case seems the best option and again this is not a tree but we can get the

value so we are trying to match the node three with the for with a fors with two no and here I just want to leave it and

show the complete example the alignment that would have produced again these are the the key roots are marked in

red and the green are the most left noes now if you imagine that got and good are trees by just creating a vertical tree

should be easy to calculate to use the S shash algorithm and get exactly the same results as with Leen Stein

algorith and in this case we get that the forest table it's exactly the same as the three table and the three table is exactly the same as the Leen Stein

table the interesting point is that the three the distance is uh it's very effici efficient in time and in

space and again like the leein it's very easy to calculate the alignment by just back tracking uh where the numbers come

from if it was a swap or or an alignment a delis or an insertion thanks for your attention and

I hope this Workshop was useful to understand the sang and Shasha algorithm on the

Loading...

Loading video analysis...