Workshop on Zhang and Shasha Tree Edit Disance algorithm + alignment
By HectorHugoFrancoPeny
Summary
## Key takeaways - **Leen Stein distance via DP matrix**: The Leen Stein distance between two strings is computed by filling a dynamic programming table, initializing the first row and column with ascending integers and using the Leen Stein formula; the total distance appears at the bottom‑right cell of the matrix. [02:09] - **Forest tables solve tree edit distance**: The Zhang–Shasha algorithm evaluates tree edit distance by building forest DP tables and considering three elementary operations—deleting the source root, inserting the target root, or swapping the two roots—selecting the cheapest combination. [15:20] - **Tree mapping constraints preserve order, ancestry, one-to-one**: A valid tree mapping must preserve right‑to‑left sibling order, keep ancestor‑descendant relationships unchanged, and map each node to at most one node. [06:10] - **Key roots and most left descendants structure DP**: Key roots and most‑left descendants partition the tree into independent sub‑problems, allowing the DP tables to be filled efficiently. [08:10] - **Backtracking recovers alignment from DP tables**: After the DP tables are filled, back‑tracking the cells tells whether each step was a swap, insertion, or deletion, recovering the full alignment. [30:01] - **Zhang-Shasha efficient, matches Leen Stein on linear trees**: The Zhang–Shasha algorithm runs in polynomial time and space, and when both trees are linear (a single path) its forest table collapses to the ordinary Leen Stein matrix. [29:50]
Topics Covered
- Valid Tree Mappings Preserve Ancestry and Order
- Key Roots Anchor Tree Structure Comparisons
- Forest Decomposition Solves Tree Edit Distance
- Tree Edit Distance Generalizes String Edit Distance
- Tree Distance Achieves Efficient Computation and Alignment
Full Transcript
workshop on three edit disc presented by Hector Franco you can find me at Franco at tcd.ie
at tcd.ie let's start seeing the Leen Stein algorithm and the sub string matching algorithm the string distance also know
as the Leen Stein algorithm it's a precursor of the three edit distance and it measure the shortest sequence of
command that transform a string s into a string T and what it does is to count the
amount of operations of uh swapping one character to another deleting uh one character in s or
inserting uh one character in t for instance uh these two family names uh
have a distance of two because the letter L um have to
be um inserted on the second on on T and the letter e have to be swapped into to
O So only two uh modifications Computing the L Stein distance it's made by uh calculating uh
intermediate cost of of matching uh sub strings of the first uh string to the
second by that always in include from the first to uh a certain amount of uh characters [Music]
so the way to do to do it for these two uh names is to uh create a table fill
the first row and column by ascendant uh numbers and then use uh the Leen Stein formula to
um fill the rest of the table and at the end we will find the distance at the bottom uh right corner of the Matrix
which mean the cost of matching the whole string to the whole string of the other side yeah that's the
distance now once we create the M The Matrix we can also uh back track the the numbers
and obtain in this way the mapping so here we can see that n match with n without any
changes so uh in both both cases uh and it match to
zero here we can imagine that um e and H doesn't match so what we did is to insert the letter uh
e here we have two possibilities because the C can match uh the first or the second
letter uh so we have two possible uh mappings for the same distance now I would like you to practice it it on your own let's imagine
the example between G and good you should be already able to fill the first row and column with ascendant numbers and probably you already have
the rest of the numbers and again in this case there is two possibilities because God have one o and good have two
so you can match in any place and the cost is one because in any case you have to insert an O how about sub string matching
mean how about having one a string in this case a universe finding a sub well the algorithm will be the same but the
first row on on universe will be all seros and that's that means that we can delete as many letters from the
beginning of uh of Universe for free and the answer will be not on the bottom right corner but on the bottom any of
the values uh will be good what means that we can delete uh characters from the from the back uh
of the string for free and we can uh get the alignment just in the same way now let's go to three addit
distance the three addit distance uh was proposed to solve what is the minimal type mapping and some shha propose an
algorithm to do it so a Time mapping is a mapping in which uh the right to left order is
preserved so siblings cannot uh swap order or the ancestry uh relation it's preserved so if one node is an ancestor
of another that relation should be kept after the match and also so one node only can
match a s another node or n but you cannot match one node into another two nodes now let's pass to the basic uh
Concepts before starting explaining the the algorithm let's start by what is a post uh postor traversal postor
the way to numerate the notes of a three inra verial postorder is by first leveling the left side then the right side and then the node
itself please try to practice yourself we have to label this three so probably you already guess it
right right side then repeat the algorithm again and so on and these are all the ancestors for a
given uh node which is very intuitive you also need the concept of the most left descent
so for each of these groups the most left descendence is the one at the at the bottom
it doesn't have to be graphically uh on the left but means the most the m means the the
fast descendant as much as descendant you can find now you also need the concept of the key uh roots and is for the same
groups that we made before um is the the higher ancestor so somehow for each group of most left descendants there is also
there is only one most left descendant and one key root and yeah one key root could be also the
most left descendant now let's go to Maps the concept of a mapping is very easy to understand there are notes that are deleted there are notes that are
inserted and there are notes that are perfectly swapped without changing the content and others are match but swapping the
cont please note that this is not a Time M so it's just an illustration of a
map and what will be the transformation so with this definition we have four sets um of
uh of swaps so they deleted they inserted the noes that are swapped that could be perfectly swapped or could be swapped with a with a
change a way to define mappings in a more formal way that the mapping is in this case it only refers to the
swaps so s is the source T is the Target and the map are pairs of note between the source and the target let's see the time mappings again
one: one sibling order preserved on store order preser so let's try to
practice let's guess if the first mapping this one here it's a tie mapping or not
ready and the answer is no this mapping is not preserving the ancestry uh relation because
sorry because this note here was not an ancestor of this one but after the transformation uh it become an ancestor
and this is not allow can you guess for the second map probably you already C it the answer
is not again because you cannot have multiple uh one node match to multiple noes can you guess the third
one well the third one it's a possible time mapping it it might be not the last cross mapping but
indeed is a possible time up and can you guess the fourth one again is not a Time mapping because
siblings is few siblings are swapping the order so the left to right restriction is not
preserved and now let's go to to the three distance algor and it could be multiple time mappings between two threes
but the three distance is the distance of the less expensive possible time
up and the cost is the is defined as the number of insertion deletions or uh swaps which again uh it can vary the
cost from one implementation to another now I want you to try to prove the triangle inequality just as an
exercise knowing that insertions deletions and swap have the same uh cost well I just want to mention that all
possible combinations of uh sets lead to the same conclusion that the triangle inequality have to be preced and now let's jump to the most
interesting point and probably that's why you are watching this video the s Shasha algorithm and how to produce the alignment well the first thing to know
in this algorithm is that there is a three table and there is a forest table sorry a set of forest
tables so the forest distance is well you know it's the distance between
a set of Threes to another set of Threes And we are going to designate it by this Lambda
so three distance is the minimal distance between deleting the first node and swapping the
forest to the second Tre uh inserting the first uh node
of the second tree and later matching adding this cones to the Tre to the forest of what we have left or just to
swap the cost of swapping uh both uh root nodes and then uh matching both uh trees both
Forest but the forest definition have it's a slightly different I mean it's exactly the same for inserting or deleting but in order to preserve uh the
right to the ancestry relation then um they have not this only swap
if the rest of the of that particular tree for which they are key Roots is also
preserved so this case we chop off the the the last three I think I already mentioned this that this is not allow to check the uh
to make sure that the ancestry relation it's a Reser and as you can see this is a possible
uh Mapp I mean a possible mapping that is not a Time mapping and this is what could uh happen and this is the SEO code for the
three distance algorithm so for HQ root
uh of the first threee and the second for each pair of uh key roots in traversal post
order um well we calculate the three edit distance and to calculate the three edit distance um we
create a a forest table um and to know how big should be we
create these bones and then we create the three uh the forest
table and again we initialize the rows and columns with uh the value of with an
ascendant value which in this case should be 0 1 2 3 until whatever number or whatever is the cost of
deleting um one element after another which we assume previously that is always one and then when we already initialize
the forest table we go for each uh left most left position of each uh
node and are equal then well we calculate the minimal
uh Forest distance between the three possibilities uh we already um inspect
before and if not it means that we just use or
uh the the uh the forest or the tree distance and again this is uh to calculate that if it's a three
then well U it all in case of sorry let me go
back in in this case we have to calculate to get the three distance that should be PR
calculated and another another forx distance which is this this case here let's try to see with a sample as
more graphical as possible in in this case let's imagine that the colors means uh different uh different
labeles so we have three different colors and we have uh two3 the the border of the notes is different just to
differentiate both this and now let's imagine that the most left array is
something like sry for the first three is one one
then the Third is itself the fourth is itself for the fifth note
is um one as well and for the six node the most left is one as well and this
the key Ro the key Roots well will be uh this not is not a key root or one
meaning this node is a key roote so in this this case the key roots are 3 4 and six again the most left
notes and the key roots and now this is the Table of atomic cost which is basically checking
the label and if the labels are the same then the cost is zero but if the label are different then the cost is
one and now let's imagine how to fill uh the table so the First Column is
useless because we start numerating by one then let's suppose that we initialize the values of the
Matrix with n so not not a number just to make easy to know which are our real values and what are not and in this case
The Matrix is 6 by six because that are the amount of notes of both three and
now we try to calculate the three distance between uh the sub three uh three with the root node three and four
because they are uh the small the first ones on uh the list and because the trees only have one
note it seems very simple so we can delete them or to swap
and we swap so we add in the three distance
table a z because the cost of these two sub threes swapping them is z
now let's see we repeat calculating uh for this a three containing three and
containing four and three and well the the way was to uh both are three so
we will fill the three table and using our forest table uh we know
that swapping three and and three cost zero so why not another extra cost of
one for swap for inserting four which happened to be well the the winner
and now we pick the roots three and five and repeat and there are just two notes which is very easy again and we get one
and again we update the table and now we pick the roots
three the the root node three of the first three and the root node six of the second three which is a lot more complex because it includes six nodes under
itself as the S so we are trying to swap one three of one note to a three with six
no so now we try to swap three and one which is the most
left note of the sub three with the root uh six and is a tree so we have to update the
table and the calculation is very easy again swapping is the best choice and now for the second three we we compare
to match the no three with the three containing one two and one and again it's very easy to operate once we know the previous result and
using our forest three table we got that the cost is one because we just insert the node two
and in this case it's important to know that we cannot look in diagonal because this now what we are trying to compare
is not a tree into a tree per case it have to come from the three uh edit distance uh to to the from the three
table and again in this case seems the best option and again this is not a tree but we can get the
value so we are trying to match the node three with the for with a fors with two no and here I just want to leave it and
show the complete example the alignment that would have produced again these are the the key roots are marked in
red and the green are the most left noes now if you imagine that got and good are trees by just creating a vertical tree
should be easy to calculate to use the S shash algorithm and get exactly the same results as with Leen Stein
algorith and in this case we get that the forest table it's exactly the same as the three table and the three table is exactly the same as the Leen Stein
table the interesting point is that the three the distance is uh it's very effici efficient in time and in
space and again like the leein it's very easy to calculate the alignment by just back tracking uh where the numbers come
from if it was a swap or or an alignment a delis or an insertion thanks for your attention and
I hope this Workshop was useful to understand the sang and Shasha algorithm on the
Loading video analysis...