File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-2145_metho.xml

Size: 15,402 bytes

Last Modified: 2025-10-06 14:14:13

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-2145">
  <Title>Error-tolerant Tree Matching</Title>
  <Section position="4" start_page="0" end_page="862" type="metho">
    <SectionTitle>
2 Approximate Tree Matching
</SectionTitle>
    <Paragraph position="0"> In this paper we consider the problem of searching in a database of trees, all trees that are &amp;quot;close&amp;quot; to a given query tree, where closeness is defined in terms of an error metric. The trees that we consider have labeled terminal and non-terminal nodes. We assume that all immediate children of a given node have unique labels, and that a total ordering on these labels is defined. We consider two trees close if we can * add/delete a small number of leaves to/from one of the trees, and/or * change the label of a small number of leaves in one of the trees to get the second tree. A pair of such &amp;quot;close&amp;quot; trees is depicted in Fignre 1.</Paragraph>
    <Section position="1" start_page="0" end_page="860" type="sub_section">
      <SectionTitle>
2.1 Linearization of trees
</SectionTitle>
      <Paragraph position="0"> Before proceeding any fllrther we would like to define the terminology we will be using in the fob lowing sections: We identify each leaf node in a tree with an ordered vertex list (re, vl, v2, ..., vd) where each vi is the label of a vertex from the root v0 to the leaf Vd at depth d, and :{'or i &gt; 0, vi is the parent of vi+ L. A tree with n leaves is represented by a vertex list sequence. VLS =.Vi,V'e,...,1/4, where each V~. = v3o, v{, v~, v~, . *., va,;, corresponds to a vertex list for a leaf at level dj. This sequence is constructed by taking into account the total order on the labels at every level, that is, 17i is lexico.qraphically less than Vi+l, based on the total ordering of the vertex labels. For instance, the first tree in Fignre 1 would be represented by the vertex list sequence:</Paragraph>
      <Paragraph position="2"> assuming the normal lexicogra.phie ordering on i,o(lc na~lles.</Paragraph>
    </Section>
    <Section position="2" start_page="860" end_page="861" type="sub_section">
      <SectionTitle>
2.2 Distance between two trees
</SectionTitle>
      <Paragraph position="0"> We deline the distan&lt;'e 1)etween two trees aeeor(1ing to the struchrral diJl~,,,'cnces or differences in leaf labels. We consider an extra or a missing leaf a.s a structural change. If, however, both trees Itave a leaw~s whose vertex lists match in all but the last (leaf vertex) lat&gt;e\], we. &lt;:onsi&lt;ler this as a dil\[erence in leaf lal&gt;cls. For instance, in I,'igm'e 2, (ihere is extra, leaf in tree (I)) in &lt;,Oml)a.rison to the tree in (a), while tree (c) has a leaf label diffc,,ence. We a.sso&lt;:iate the f'ollowing costs associated with these &lt;lifl'erences: * If I&gt;oth trees have a. lea\[' whose verl;ex list matches in all but the last (leaf w.'l:tex) ta-bel, we assign a label &lt;lill~rence error of C.</Paragraph>
      <Paragraph position="1"> * \[\[' a certa,in leaf is missing in one of&amp;quot; the trees but exists in the other one, then we assign a &lt;:ost S for this a structural dilI'erence.</Paragraph>
      <Paragraph position="2"> We &lt;'urrently treat all structural or leaf label &lt;\]if:fere,&lt;:es as incurring a. cost that is indel&gt;endent of the tree level at whi&lt;'h \];he difference takes i)lacc.</Paragraph>
      <Paragraph position="3">  If, however, ditl~rences that ar0. closer to the root of the tree are considered to b(' more serious than differences further away \[\]:on~ the root, it is \]&gt;os-sible to mo&lt;lify the formulation to take this into ~tCCOtl nt.</Paragraph>
      <Paragraph position="4"> 2,3 Conw'xting a set of trees into a the.</Paragraph>
      <Paragraph position="5"> A h'ee database l) &lt;:onsists of a set o\[' trees '/~, &amp;quot;1~, *.., 5/~., ea.ch &amp;quot;1) being a vertex list sequ&lt;mce for a tree. Once we convert all the trees to a linear form, we haw: a set; o\[&amp;quot; vertex list sequences. We can convert this set into a trie data structure. This trie will compress ;-'~l\]y l&gt;ossible redundancies in the prefixes of the vertex list; sequences to achieve a.</Paragraph>
      <Paragraph position="6"> certain ('ompa&lt;'tion which hell&gt;s during searching. \] For insta.nce, the three trees in F\[gttre 2 can I&gt;e re4&gt;resente&lt;l as a trie as shown in Figm'e, 3.</Paragraph>
      <Paragraph position="7"> The edge labels along the t&gt;ath to a h'af when concat&lt;'.nate&lt;l in order gives the vertex list sequence for a tree, e.g., ((a,b,a,x), (a,b,c), (a,b,k), (a,e)) repr&lt;;sents the tree (a) il) Figure ~.</Paragraph>
      <Paragraph position="8"> t Note that i~ is possible to obtain more spa&lt;:c reduction by aJso sharing any common postflxes of Lhe vertex labe\] sequences using a directed acy&lt;:lic graph representation and not a. trie, but this does not iraprow:' the execution time.</Paragraph>
    </Section>
    <Section position="3" start_page="861" end_page="861" type="sub_section">
      <SectionTitle>
2.4 Error-tolerant, matching in the trie
</SectionTitle>
      <Paragraph position="0"> Our concern in this work is not the exact match of trees but rather approximate match. Given the vertex list sequence for a query tree, exact match over the trie can be performed using the standard t;ech niques by fbllowing the edge labeled with next vertex list until a loft in the trie is reached, ~-md the query vertex label sequence is exhausted.</Paragraph>
      <Paragraph position="1"> For approximate tree matching, we use the error-tolerant approximate tinite-state recognition algorithm (Oflazer, 1996), which tinds all strings within a giwm error threshold of some string in the regular set accepted by the underlying finite-state acceptor. An adaptation of this algorithm will be briefly summarized here.</Paragraph>
      <Paragraph position="2"> hh:ror-tolerant matching of vertex list sequences requires an errol: inetric for measuring how rnuch two such sequences deviate from each other. The distance between two sequences measures the minimum number of insertions, deletions and leaf label changes that are necessary to convert one tree into another. It should be noted that this is different fl:om the error metric defined by (Wang el M., 1994).</Paragraph>
      <Paragraph position="3"> Let Z = Z1, Z.~,..., Zp, denote a generic vertex list sequence of p vertex lists. Z\[j\] denotes the initim subsequence of Z up to and including the ju~ leaf label. We will use X (of length rn) to denote the query vertex list sequence, and Y (of length n) to denote the sequence that is a (possibly pattie.I) candidate vertex list sequence (from (;he database of trees).</Paragraph>
      <Paragraph position="4"> Given two vertex list sequences X and Y, the distance, disffX\[m\], Y\[n\]), computed according to the recurrence below, gives the minimum number of leaf insertions, deletions or lea\[' label  (:hai~ges necessary to change one tree to the other. dist(X\[m\], Y\[n\]) = dist(X\[m- 1\], Y\[n- 1\]) if x,~, = y,,.</Paragraph>
      <Paragraph position="5"> (last vertex lists a.re sa.me) : ,ti.&lt;x\[.~ - l\], z\[,~ - ,\]) + c' if x., a, nd y,~ differ only ~tt the</Paragraph>
      <Paragraph position="7"> For a tree database D and at distance threshold t &gt; O, we consider a query tree represented by a wertex list sequence X\[m\] (not in the database) to match the database with an error of t, if the set C : {r\[&amp;quot;,\]l Y\[&amp;quot;,\] &lt; 10 and distX\[,,~\], Yb\]) -&lt; t} is not empty.</Paragraph>
    </Section>
    <Section position="4" start_page="861" end_page="862" type="sub_section">
      <SectionTitle>
2.5 An algorithm for approximate tree
</SectionTitle>
      <Paragraph position="0"> mat ehing Standard searching with a trie corresponds to traversing a path starting t}om the start node (o\[' the trie), to one of tlle lea\[' nodes (of the trie), so that the concatenation of the labels on the arcs along this path matches the input vertex list sequence. For error-tolerant matching, one needs to lind all paths from the start node lo one of the final nodes, so lhat wh.en lhe labels on the edges along a path are concatenated, lhc resulting &amp;quot;verlea; list sequence is within a given dislance lh, rcshold t, of the query vertex list sequence.</Paragraph>
      <Paragraph position="1"> This search has to be very fast if apl)roximate matching is to be of any practical use. This means that paths in the trie that can lead to no solutions have to be pruned so that the search can be limited to a very small percentage of the search space. We need to make sure that any candidate (1)refix) vertex list sequence that is generated as the search is being p'erfbrmed, does not deviate from certain initial subsequences of&amp;quot; the query sequence by more than the allowed threshold. To detect such cases, we use the notion ol'a cnl-off distance.</Paragraph>
      <Paragraph position="2"> The cut-off distance measures the minimum dislance between an initial subsequence of the query  sequence sequel:it(% a.nd the (possibly partial) can(lidate soqtlOll(-(L I,et Y he ;~ l)a.rtial candi(lato se&lt;,llleric(~ whose lmagth is n, and le, t X be tl/c query soqll('\[lC(~ O\[ hmgth m.. I,c't l= lnin(l,n,- LZ/M\]) a,,(i ,, = ,~,~x(,,,, ,+ + \[Z/iV/i) wl, e,:o a4 is ti,(, ((,so of ittsol:tions nnd deloi;ions. 'l'h(~ cut+ol+f distance c,,:d/.s+t(X\[r,,\], +7\[,,\]) is defino(l a.s ..:d/.~(X\[,,.\], r\[,,\]) : mh, d.i.v,(x \[:\], &lt;\[,,\]). /&lt;i&lt;.u Note l;hat; ex('ept; at the boulldarios, the iuitial  subscquonces of the query soquence X considorexl ,,.,; or ,(;,,gt,, EWe41 i,o ,o.gt, ,, + \[:/A4\]. A,,y initial sul)scquonce of X shortor tha.tl I ,loods .,oro IJmn LI/M\] l,~af nodo i(isertions, ;rod nny itiitial stll)string of X loilger tha.n &amp;quot;u ro(ltfires nlore t\[ta.n I-\[I/M\] h',a.f no(h: (\[cletions, to a.t bast equal Y in Iougth, violating the dist;mee constrMnt.</Paragraph>
      <Paragraph position="3"> Givcu a. vcrl.ex list se, qlw, n(:o X (corresponding to a, (tuery l;reo), a lm.rtial ca.ndidate seqllenco Y is geuorat(xl I)y su(:c(~ssively (:ollcaten;~ting labels a.loug tire ~u'cs as tt:+msitions a, ro tn~t(le, sta.rting with l;ho start state. Wltolmvor wo extcn(t Y going a, long the trio, we chock if the cut-off distmwo of X and the i~artial Y is within the botu,I slu'cifiod by the threshold /. If the c,t-olf distnllco goes l)oyoud I, ho throshol(l, the lasl; edgo is Imcl(o&lt;l off' to tim source nodo (in p+u'a,lM with the short(',hint 0\[' Y) ~,i(l some other o(Ige is t;ried, l\]a.clC/tr:-tcMng is t:e(:ursively apl)lie(l when tit('\] semr(:h can .c.1; l)e contimled from tlmt nodo. If, during tho ('c.l)~sl, ruetion of Y, a, tormin;q node (which ma.y or llmy .ot l)o a, leaf of the trie) is reached without viola.tittg l, hc (:utoff (listan(:e co.stmail,t, a,n(I d',:.~t(X\[,,4, Y\[,\]) &lt; t at t,i~t poll,,, tti,;,i V is ++ tr(+c in t.h(&amp;quot; (l+~ta.l&gt;asc t.hat tt,aJ.chos th(' iNl)Ut (It,ory S(XlUOnCC. 2 I)(!noLhig tile nodes of the trio }&gt;y subs('rilfl;e(l (l'S (qo being the inil;ial uode (e.g., top node in Figure 3)) a,n( |the la.bols of tl:lo edges l)y V, a, nd denoting by 8(qi, 17) the taodo in IJto t, rie that oiie Ca, ll reach \[\[rOlll llo(l(' qi with edgo la, bol V ((|ellotillg :,t vortex list,), wo l)rcsettt, in I,'igurc /I, the a, lgorithut \[Lr gonera, thlg a.ll Y's by a (slightly tnodifiod) dopllhfirst probing or l, he trio. '\['he cru(:ia.\] point ill l;his a, lgoril, hln is tha, t tile cut-eli (listauco conil)ut;t,l;ion c;i,o be per\['ortncd very ofticiontly by ui~tintainhig ;1 ilia, trix II whhth is ;i, al ill, \])y It illa.i, rix with el,.,,,o,,t//(i,j) = d/,~:.(x\[,:\], Y\[./\]) (,),t +.t&lt;i (:i,at.g, 1992). We (:~ttt t-tote that the (',OlUt)Ul;~tion or l, he olettic;nt II (i + 1, j q- 1 ) recursJvely de.ponds on only //(i, j), II (i, .7-i-i), 11 (i+ 1, j) f,.o,u the earlier dof initiou of tlic edit disl,anco (see l,'iguro 5.) \])uring the dopl, h first, so~u'c,}i ()t' the t('i% ont;rios in cohnnn 'n, o1' the lna, trix 11 \]ia,vo I;o I)o (ro)contl:)ut;ed , ottiy when the ('an(li(hd;e sl;rilig is o\[ Ioligth n. \])urhlg Imx'kt, rax:king, I, ho entries for the last coititlill are 2Nol, e tl,;Lt wc ha, vc to do this chock since we may coinc to other irreleva.nt, tcrminat nodes during I.he ,q(}aYch.</Paragraph>
      <Paragraph position="4"> /*push empty candidate, and start node to start search */ P,t.~h.(( ' , q0 )) while stack not empgy begin l,op((Y',qi)) /* pop partial sequence Y' and the node */ for all qj and V such that 6(qi, 'l/) : qa begin /* extend the candidate sequence */</Paragraph>
      <Paragraph position="6"> /* u is the current length o:17 Y */ /* check if g has deviated too much, if not push */  i* .',a.,Zi.~l(X\['.,.\], Yb\]) -&lt; t then p',t.~h((&lt; q,)) /* also see if we are at a filial state */ i* ,Z/s~.(X\[,,,,\], Y\[,,.\]) &lt; : and q,i is a terminal node then output V end end  disca.rdod, but the entries in prior c ohumls m:o still valid. Thus all enl;ries required by It (i + 1, ,7&amp;quot; -I- 1), except I\[(i, j + \]), axe nlre~dy awLihtble in the matrix in cohlmns i - \] a.nd i. The conaputation of c'uldisl,(X\[,t,\], Y\[n\])invo|vcs ~ loop in whioh the minhntun is colul)uted. 'l'his loop (hldexing along</Paragraph>
      <Paragraph position="8"> needed for the computaiAon ol7 ll(i + l,j + 1).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="862" end_page="863" type="metho">
    <SectionTitle>
3 Experiniental Results
</SectionTitle>
    <Paragraph position="0"> W(; hamo cxperinl(;nLed with 3 synthcticly goner a.tod sots of trees with the propeJ'tios given in 'l'+&gt; I)lc 1. lit this l;Mqe, {.tie third cohllilll (label ALl') gives the ~tvera.gc rat, to of the vertices at each level which a.ro ra.ndomly soie(;ted as lea\[ vortices in ;t tree. '\['hc' \['ourth column gives the trl~xinmirl nltnibet of children that a uon-leaPS node lna.y h~tvo.</Paragraph>
    <Paragraph position="1"> Tile \[a.st column gives the maxinnnn depth of the trees in rite, t, d~ttal)~LSO.</Paragraph>
    <Paragraph position="2"> From I, heso synthetic ,.I,nb'-dm, ses, we ra.ndo\]nly oxtra.ctod 100 trees arid the, perturbed thcnl with ramlom leaf deletions, insertions and la.bol changes so that l;\]ioy were o\[' some (listmlce l'ron~ ~t  rees tree in the originaJ tree. We used thresholds t = 2 and t = 4, allowing an error of C = 1 for each le~ff label change and an error of S = 2 for each insertion or deletion (see Section 2.2). We then ran ore' algoridnn on these data sets and obtained perfof mance information. All runs were performed on a Sun SpareStation 20/61 with 128M real memory.</Paragraph>
    <Paragraph position="3"> The results are presented in '\].'able 2. It can be  tree matching algorithm.</Paragraph>
    <Paragraph position="4"> seen that the approximate search algorithm is very fast for the set; of synthetic tree d;~tabases that we have experimented with. It certainly is also possible that additional space savings can be achieved if directed acyclic graphs can be used to represent the tree database taking into account both comlnon prefixes and common suffixes of vertex list; sequences.</Paragraph>
  </Section>
  <Section position="6" start_page="863" end_page="863" type="metho">
    <SectionTitle>
5 Acknowledgments
</SectionTitle>
    <Paragraph position="0"> This research was in part funded by a NATO Science for Stability Phase III Project Grant - TULANGUAGE. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML