File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1057_metho.xml
Size: 33,866 bytes
Last Modified: 2025-10-06 14:14:36
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1057"> <Title>String Transformation Learning</Title> <Section position="4" start_page="0" end_page="444" type="metho"> <SectionTitle> 2 The learning paradigm </SectionTitle> <Paragraph position="0"> The learning paradigm we adopt is called error-driven learning and has been originally proposed in (Brill, 1995) for part of speech tagging applications. We briefly introduce here the basic assumptions of the approach.</Paragraph> <Paragraph position="1"> A string transformation is a rewriting rule denoted as u -* v, where u and v are strings such that \[u\[ = Ivt. This means that ifu appears as a factor of some string w, then u should be replaced by v in w.</Paragraph> <Paragraph position="2"> The application of the transformation might be conditioned by the requirement that some additionally specified pattern matches some part of the string w to be rewritten.</Paragraph> <Paragraph position="3"> We now describe how transformations can be automatically learned. A pair of strings (w, w') is an aligned pair if IT\[ = \]w'\[. When w = uzsuffi(w), w' = u'x'suffi(w' ) and Ixl = Ix'l, we say that factors x and x' occur at aligned positions within (w, w'). A multi-set of aligned pairs is called an aligned corpus. Let (w, w ') be an aligned pair and let 7- be some transformation of the form u --~ v.</Paragraph> <Paragraph position="4"> The positive evidence of v (w.r.t. (w, w')) is the number of different positions at which factors u and v are aligned within (w, w'). The negative evidence of r (w.r.t. w, w ~) is the number of different positions at which factors u and u are aligned within i and ending at position j (hence \[1, 2\] denotes ac). (w, w'). Intuitively speaking, positive (negative) evidence is a count of how many times we will do well (badly, respectively) when using v on w in trying to get w'. The score associated with v is the difference between the positive evidence and the negative evidence of r. This extends to an aligned corpus in the obvious way. We are interested in the set of transformations that are associated with the highest score in a given aligned corpus, and will develop algorithms to find such a set in the next sections.</Paragraph> </Section> <Section position="5" start_page="444" end_page="446" type="metho"> <SectionTitle> 3 Data Structures </SectionTitle> <Paragraph position="0"> This section introduces two data structures that are basic to the development of the algorithms presented in this paper.</Paragraph> <Section position="1" start_page="444" end_page="445" type="sub_section"> <SectionTitle> 3.1 Suffix trees </SectionTitle> <Paragraph position="0"> We briefly present here a data structure that is well known in the text processing literature; the reader is referred to (Crochemore and Rytter, 1994) and (Apostolico, 1985) for definitions and further references.</Paragraph> <Paragraph position="1"> Let w be some non-null string. Throughout the paper we assume that the rightmost symbol of w is an end-marker not found at any other position in the string. The suffix tree associated with w is a &quot;compressed&quot; trie of all strings suffi(w), 1 <i< Iwl. Edges are labeled by factors of w which are encoded by means of two natural numbers denoting endpoints in the string. An example is reported in Figure 1.</Paragraph> <Paragraph position="2"> An implicit node is a node not explicitly represented in the suffix tree, that splits the label of some edge at a given position. (Each implicit node corresponds to some node in the original trie having only one child.) We denote by parent(p) the parent node of (implicit) node p and by label(p, q) the label of the edge spanning (implicit) nodes p and q.</Paragraph> <Paragraph position="3"> Throughout the paper, we take the dominance relation between nodes to be reflexive, unless we write proper dominance. We also say that implicit node q immediately dominates node p if q splits the arc between parent(p) and p. Of main interest here are the following properties of suffix trees: * if node p has children Pl .... , Pd, then d _> 2 and strings label(p, pi) differ one from the other at the leftmost symbol; . all and only the factors of w are represented by paths from the root to some (implicit) node; * the statistic of factor u of w is the number of leaves dominated by the (implicit) node ending the path representing u.</Paragraph> <Paragraph position="4"> In the remainder of the paper, we sometimes identify an (implicit) node of a suffix tree with the factor represented by the path from the root to that node.</Paragraph> <Paragraph position="5"> The suffix tree and the statistics of all factors of w can be constructed/computed in time O(\[w\[), as reported in (Weiner, 1973) and (McCreight, 1976).</Paragraph> <Paragraph position="6"> McCreight algorithm uses two basic functions to scan paths in the suffix tree under construction.</Paragraph> <Paragraph position="7"> These functions are briefly introduced here and will be exploited in the next subsection. Below, p is a node in a tree and u is a non-null string.</Paragraph> <Paragraph position="8"> function Slow_scan(p, u): Starting at p, scan u symbol by symbol. Return the {implicit) node corresponding to the last matching symbol.</Paragraph> <Paragraph position="9"> The next function runs faster than Slow_scan, and can be used whenever we already know that u is an (implicit) node in the tree (u completely matches some path ill the tree).</Paragraph> <Paragraph position="10"> function Fast_scan(p, u): Starting at p, scan u by iteratively (i) finding the edge between the current node and one of its children, that has the same first symbol as the suffix of u yet to be scanned, and (ii) skipping a prefix of u equal to the length of the selected edge label. Return the (implicit) node u.</Paragraph> <Paragraph position="12"> number; if the incident node is an implicit node, then we add between parentheses the relative position w.r.t.</Paragraph> <Paragraph position="13"> the arc label.</Paragraph> <Paragraph position="14"> From each node au in the suffix tree, au some factor, McCreight's algorithm creates a pointer, called an slink, to node u which necessarily exists in the suffix tree. We write q = s-link(p) if there is an s-link from ptoq.</Paragraph> </Section> <Section position="2" start_page="445" end_page="446" type="sub_section"> <SectionTitle> 3.2 Suffix tree alignment </SectionTitle> <Paragraph position="0"> In the next section each transformation will be associated with several strings. Given an input text, we will compute transformation scores by computing statistics of these strings. This can easily be done using suffix trees, and by pairing statistics corresponding to the same transformation. The latter task can be done using the data structure originally introduced here.</Paragraph> <Paragraph position="1"> A total function h : E ~ E ~, ~ and E' two alphabets, is called a (restricted) homomorphism. We extend h to a string function in the usual way by posing h(C/) = s and h(au) = h(a)h(u), a E E and u E E*. Given w,w' E E +, we need to pair each factor u of w with factor h(u) possibly occurring in w ~. To solve this problem, we construct the suffix trees T,T' for w,w', respectively. Then we establish an a-link (a pointer) from each node u of T, u some factor, to the (implicit) node h(u) of T ~, if h(u) exists. Furthermore, if factor ua with a E E is an (implicit) node of T such that h(u) but not h(ua) are (implicit) nodes of T', we create node u in T (if u was an implicit node) and establish an a-link from u to (implicit) node h(u) of T'. Note that the total number of a-links is O(Iwl). The resulting data structure is called here suffix tree aligmnent. An example is reported in Figure 2.</Paragraph> <Paragraph position="2"> We now specify a method to compute suffix tree alignments. In what follows p,p~ are tree nodes and u is a non-null string. Crucially, we assume we can access the s-links of T and T'. Paths u and v in T and T', respectively, are aligned if v = h(u). The next two functions are used to move a-links up and down two aligned paths.</Paragraph> <Paragraph position="3"> function Move_link_down(p,p',u): Starting at. p and p', simultaneously scan u and h(u), respectively, using function Slow_scan. Stop as soon as a symbol is not matched. At each encountered node of T and at the (implicit) node of T corresponding to the last successful match, create an a-link to the paired (implicit) node of T'. Return the pair of nodes in the lastly created a-link Mong with the length of the successfully matched prefix of u.</Paragraph> <Paragraph position="4"> In the next function, we use function Fast_scan introduced in Section 3.1, but we run it upward the tree (with the obvious modifications).</Paragraph> <Paragraph position="5"> function Move_link_up(p,p'): Starting at p and p', simultaneously scan the paths to the roots of T and T', respectively, using function Fast_scan. Stop as soon as a node of T is encountered that already ha.s an a-link. At each encountered node of T create an a-link to the paired (implicit) node of T'.</Paragraph> <Paragraph position="6"> We also need a function that &quot;shifts&quot; a-links to a new pair of aligned paths. This is done using s-links. The next auxiliary function takes care of those (implicit) nodes for which the s-link is missing. (This is the case for implicit nodes of T ~ and for some nodes of T that have been newly created.) We rest on the property that the parent node of any such (implicit) node always has an s-link, when it differs from the root.</Paragraph> <Paragraph position="7"> function Up_link_down(p): If s-link(p) is defined then return s-link(p). Else, let pl = parent(p). If Pl is not the root node, let P2 = s-link(p1) and return (implicit) node FasLscan(p2,1abel(pl,p)).</Paragraph> <Paragraph position="8"> If Pl is the root node, return (implicit) node</Paragraph> <Paragraph position="10"> We can now present the algorithm for the construction of suffix tree alignments.</Paragraph> <Paragraph position="11"> Algorithm 1 Let T and T' be the suffix trees for strings w and w', respectively: ( bl~l ' Gl ' d) ,-- Move_link_down(root of T, and the established a-links at each iteration of Algorithm 1, when constructing the suffix tree aligmnent in Figure 2. To denote a-links we use the same integer numbers as in Figure 2.</Paragraph> <Paragraph position="12"> root of T', for i from \]w I - 1 downto 1 do begin</Paragraph> <Paragraph position="14"> In Figure 3 a sample run of Algorithm 1 is schematically represented.</Paragraph> <Paragraph position="15"> In the next section we use the following properties of Algorithm 1: * after T and T' have been processed, for every node p ofT representing factor u of w, (implicit) node a-link(p) of T ~ is defined if and only if a-link(p) represents factor h(u) of w'; * the algorithm can be executed in time O(Iwl + Iw'l).</Paragraph> <Paragraph position="16"> The first property above can be proved as follows. For 1 < i < Iwl, bi in Algorithm 1 is (the node representing) the longest prefix of suffi(w ) such that h(bi) is an (implicit) node of T' (is a factor of w'). This can be proved by induction on \[w I -i, using the definition of Move_link_down and of s-link. We then observe that, if u is a node of T, then factor u is a prefix of some suffi(w ) and either u dominates bi or bi properly dominates u in T. If u dominates bi, then * h(u) must be an (implicit) node ofT'. In this case an a-link is established from u to h(u) by Move_link_up or Move_link_down, depending on whether u dominates or is dominated by sbi in T. If bi properly dominates u, h(u) does not occur in w'. In this case, node u is never reached by the algorithm and no a-link is established for this node.</Paragraph> <Paragraph position="17"> The proof of the linear time result is rather long, we only give an outline here. The interesting case is the function Shift_link, which is executed Iwl- 1 times by the algorithm. When executed once on nodes p and if, Shift_link uses time 0(1) if s-link(p) and s-link(p ~) are both defined. In all other cases, it uses an amount of time proportional to the number of (implicit) nodes visited by function FasLscan, which is called through function Up_link_down. We use an amortization technique and charge a constant amount of time to the symbols in w and w', for each node visited in this way. Consider the execution of Shifl_link(bi+l, b~+l) for some i, 1 < i < Iw\[- 1. Assume that, correspondingly, Fast_scan visits nodes ul,...,Ud of T in this order, with d __ 1 and each uj some factor of w. Then we have that each uj is a (proper) prefix of uj+l, and Ud = sbi. For each u j, 1 < j _< d- 1, we charge a constant amount of time to the symbol in w &quot;corresponding&quot; to the last symbol of uj. The visit to Ud, on the other hand, is charged to the ith symbol of w. (Note that charging the visit to ud to the symbol in w &quot;corresponding&quot; to the last symbol of Ud does not work, since in the case of sbi ---&quot; bi the same symbol would be charged again at the next iteration of the for-cycle.) It is not difficult to see that, in this way, each symbol of w is charged at most once. A similar argument works for visits to nodes of T' by Fast_scan, which are charged to symbols of u?. This shows that the time used by all executions of Shift_link is 0(Iwl + Iw'l).</Paragraph> <Paragraph position="18"> Suffix trees and suffix tree alignments can be generalized to finite multi-sets of strings, each string ending with the same end-marker not found at any other position. In this case each leaf holds a record, called count, of the number of times the corresponding suffix appears in the entire multi-set, which will be propagated appropriately when computing factor statistic. Most important here, all of the above results still hold for these generalizations. In the next section, we will deal with the multi-set case.</Paragraph> </Section> </Section> <Section position="6" start_page="446" end_page="449" type="metho"> <SectionTitle> 4 Transformation learning </SectionTitle> <Paragraph position="0"> This section deals with the computational problem of learning string transformations from an aligned corpus. We show that some families of transformations can be efficiently learned exploiting the data structures of Section 3. We also consider more general kinds of transformations and show that for this class the learning problem is NP-hard.</Paragraph> <Section position="1" start_page="446" end_page="447" type="sub_section"> <SectionTitle> 4.1 Data representation </SectionTitle> <Paragraph position="0"> We introduce a representation of aligned corpora that reduces the problem of computing the positive/negative evidence of transformations to the problem of computing factor statistics.</Paragraph> <Paragraph position="1"> Let (w, w') be an aligned pair, w = al.. &quot;an and w'=a'l...a,~; withaiEEforl<i<n, andn>_ 1.</Paragraph> <Paragraph position="2"> We define</Paragraph> <Paragraph position="4"> Note that w x w ~ is a string over the new alphabet E x E. Let N > 1 and let L = {(wl, w~),..., (Wg, W~v)} be an aligned corpus. We represent L as a string multi-set over alphabet E x E:</Paragraph> <Paragraph position="6"> where w x w ~ appears in Lx as many times as (w, w ~) appears in L.</Paragraph> </Section> <Section position="2" start_page="447" end_page="449" type="sub_section"> <SectionTitle> 4.2 Learning algorithms </SectionTitle> <Paragraph position="0"> Let L be an aligned corpus with N aligned pairs over a fixed alphabet E, and let n be the length of the longest string in a pair in L. We start by considering plain transformations of the form</Paragraph> <Paragraph position="2"> where u, v E E +, lul = Ivl, We want to find all instances of strings u, v E E* such that, in L, u ~ v has score greater or equal than the score of any other transformation. Existing methods for this problem are data-driven. They consider all pairs of factors (with lengths bounded by n) occurring at aligned positions within some pair in L, and update the positive and the negative evidence of the associated transformations. They thus consider O(Nn 2) factor pairs, where each pair takes time O(n) to be read/stored. We conclude that these methods use an amount of time O(Nn3). We can improve on this by using suffix tree alignments.</Paragraph> <Paragraph position="3"> Let Lx be defined as in (2) and let hi : (E x E)</Paragraph> <Paragraph position="5"> Recall that, each suffix of a multi-set of strings is represented by a leaf in the associated suffix-tree, because of the use of the end-marker, and that each leaf stores the count of the occurrences of the corresponding suffix in the source multi-set. We schematically specify our first learning algorithm below.</Paragraph> <Paragraph position="6"> Algorithm 2 Step 1: construct two copies Tx and T x of the suffix tree associated with Lx and align them using hi; Step 2: visit trees Tx and T~ in post-order, and annotate each node p with the number e(p) computed as the sum of the counts at leaves that p dominates; Step 3: annotate each node p of Tx with the score e(p) - e(p'), where p' = a-link(p) if a-link(p) is an actual node, p~ is the node immediately dominated by a-link(p) if a-link(p) is an implicit node, and e(p ~) = 0 if a-link(p) is undefined; make a list of the nodes with the highest annotated score.</Paragraph> <Paragraph position="7"> Let p be a node of Tx associated with factor u x v.</Paragraph> <Paragraph position="8"> Integer e(p) computed at Step 2 is the number of times a suffix having u x v as a prefix appears in strings in Lx. Thus e(p) is the number of different positions at which factors u and v are aligned within Lx and hence the positive evidence of transformation u --~ v w.r.t. L, as defined in Section 2. Similarly, e(#) is the statistic of factor u >< u and hence the negative evidence of u --+ v (as well as the negative evidence of all transformations having u as left-hand side). It follows that Algorithm 2 records, at Step 3, the transformations having the highest score in L among all transformations represented by nodes of Tx. It is not difficult to see that the remaining transformations, denoted by implicit nodes of Tx, do not have score greater than the one above.</Paragraph> <Paragraph position="9"> The latter transformations with highest score, if any, can be easily recovered by visiting the implicit nodes that immediately dominate the nodes of Tx recorded at Step 3.</Paragraph> <Paragraph position="10"> A complexity analysis of Algorithm 2 is straightforward. Step 1 can be executed in time O(Nn), as discussed in Section 3. Since the size of Tx and T~< is O(Nn)~ all other steps can be easily executed in linear time. Hence Algorithm 2 runs in time O(Nn).</Paragraph> <Paragraph position="11"> We now turn to a more general kind of transformations. In several natural language processing applications it is useful to generalize over some transformations of the form in (3), by using classes of symbols in E. Let t > 1 and let C1, *.., Ct be a partition of E (each Ci ~-O). Consider F = {C1,..., Ct} as an alphabet. We say that string al...ad E ~+ matches string Ci,...Cid E F + if ak E Cik for</Paragraph> <Paragraph position="13"> u, v E E +, lut = Ivt, 7 E F +, and assume the following interpretation. An occurrence of string u must be rewritten to v in a text whenever u is followed by a substring matching 7. String 7 is called the right context of the transformation. The positive evidence for such transformation is the number of positions at which factors ux and vx ~ are aligned within the corpus, for all possible x, x ~ E E + with x matching 7. (We do not require x = x', since later transformations can change the right context.) The negative evidence for the transformation is the number of positions at which factors ux and ux ~ are aligned within the corpus, x, x C/ as above.</Paragraph> <Paragraph position="14"> We are not aware of any learning method for transformations of the form in (4). A naive method for this task would consider all factor pairs appearing at aligned positions in some pair in L. The left component of each factor must then be split into a string in E + and a string in F +, to represent a transformation in the desired form. Overall, there are O(Nn 3) possible transformations, and we need time O(n) to read/store each transformation. Then the method uses an amount of time O(Nn4). Again, we can improve on this. We need a representation for right context strings. Define homomorphism</Paragraph> <Paragraph position="16"> v / _ 7. Our notation can more easily be generalized, as it is needed in some transformation systems.</Paragraph> <Paragraph position="17"> (h2 is well defined since r is a partition of E.) Let</Paragraph> <Paragraph position="19"> where h2(w x w') appears in Lr as many times as w x w' appears in L x.</Paragraph> <Paragraph position="20"> realized, where dashed arrows denote a-links, black circles denote nodes, and white circles denote nodes that might be implicit. Integer e > 0 is a count of the paths from node q downward, having the form y x y' with a prefix of y matching 7. Similarly, e ~ is a count of the paths from node q~ downward satisfying the same matching condition with 7. The matching condition is enforced by the fact that the above paths have their ending leaf nodes a-linked to a leaf node of Tr dominated by node p.</Paragraph> <Paragraph position="21"> Below we link a suffix-tree to more than one suffixtree. In the notation of a-links we then use a subscript indicating the suffix tree of the target node, in order to distinguish among different linkings. We now schematically specify the learning algorithm; additional computational details will be provided later in the discussion of the complexity.</Paragraph> <Paragraph position="22"> Algorithm 3 Step 1: construct two copies Tx and T~ of the suffix tree associated with Lx and construct the suffix tree Tr associated with Lr; Step 2: align Tx with T&quot; using hi and align the resulting suffix trees Tx and T~ with Tr using h~; Step 3: for each node p of Tr, store a set v(p) including all triples (q, e, e') such that (see Figure 4): * q is a node of Tx such that a-linkTr(q) properly dominates p * e > 0 is the sum of the counts at leaves of Tx dominated by q that have an a-link to a leaf of Tr dominated by p * if ql = a_linkT, x (q) is defined, e' is the sum of the counts at leaves of T x dominated by q' that have an a-link to a leaf of Tr dominated by p; otherwise, e ~ = 0; Step 4: find all pairs (p,q), p a node of Tr and (q, e, e') E v(p), such that e - e ~ is greater than or equal to any other el - e~, (ql, el, el) in some r(pl). We next show that if pair (p, q) is found at Step 4, then q represents a factor u x v, p represents a factor h2(u x v)7, and transformation u7 ~ v -- has the highest score among all transformations represented by nodes of Tx and Tr. Similarly to the case of Algorithm 2, this is the highest score achieved in L, and other transformations with the same score can be obtained from some of the implicit nodes immediately dominating p and q.</Paragraph> <Paragraph position="23"> Let p aid q be defined as in Step 3 above. Assume that q represents a factor u x v of some string in Lx and p represents a factor 87 E F* of some string in Lr, where \[81 = lul. Since a-linkTr(q) dominates p, we must have h2(u x v) = 8. Consider a suffix (u x v)(z x x')(y x y') appearing in ~ > 0 strings in Lx, such that h2(x x x') = 7. (This means that x matches 7, and there are at least ~ positions at which u --+ v has been applied with a right-context of %) We have that string h2((u x v)(x x x')(y x y')) = &Th2(y x y') must be a suffix of some strings in Lr. It follows that (u x v)(x x z')(y x y') is a leaf of Tx with a count of ~, ~Th2(y x y') is a leaf of Tr, and there is an a-link between these two nodes. Leaf (u x v)(x x z')(y x y') is dominated by q, and leaf &Th2(y x y') is dominated by p. Then, at Step 3, integer ~ is added to e. Since no condition has been imposed above on string x' and on suffix (y x y'), we conclude that the final value ofe must be the positive evidence of transformation u7 --+ v --. A similar argument shows that the negative evidence of this transformation is stored in e'. It then follows that, at Step 4, Algorithm 3 finds the transformations with the highest score among those represented by nodes of Tx and Tr.</Paragraph> <Paragraph position="24"> Algorithm 3 can be executed in time O(Nn2). We only outline a proof of this property here, by focusing on Step 3. To execute this step we visit Tr in post order. At leaf node p, we consider the set F(p) of all leaves q of Tx such that p = a-linkTx (q), and the set F~(p) of all leaves q~ of T~ such that p = a-linkTx (q'). For each (implicit) node of T&quot; that dominates some node in F~(p) and that is the target of some a-link (from some source node of Tx), we record the sum of the counts of the dominated nodes in Fl(p). This can be done in time O(IF'(p)l n). For each node q of Tx dominating some node in F(p), we store in v(p) the triple (q,e, e'), since a-linkTr(q) necessarily dominates p. We let e > 0 be the sum of the counts of the dominated nodes in F(p), and let e' be the value retrieved from the a-link to T', if any. This takes time O(IF(P)l n). When p ranges over the leaves of Tr, we have ~-~p IF(p)I = EC, IF'(p)I = O(Nn). We then conclude that sets r(p) for all leaves p of Tr can be computed in time O(Nn2). At internal node p with children Pi, 1 < i _< d, d > 1, we assume that sets r(pi)'s have already been computed. Assume that for some i we have (q, ei, e~) E r(pl) and a-linkTr(q) does not immediately dominate Pi. If ' to e, respectively; (q, e, e') E r(p), we add ei, e i e', otherwise, we insert (q, el, e{) in r(p). We can then compute sets r(p) for all internal nodes p of Tr using an amount of time }-'~p Ir(p)t = O(Nn=).</Paragraph> </Section> <Section position="3" start_page="449" end_page="449" type="sub_section"> <SectionTitle> 4.3 General transformations </SectionTitle> <Paragraph position="0"> We have mentioned that the introduction of classes of alphabet symbols allows abstraction over plain transformations that is of interest to natural language applications. We generalize here transformations in (47 by letting 7 be a string over E U F. More precisely, we assume 7 has the form:</Paragraph> <Paragraph position="2"> where u0,ud E ~*, ui E ~+ and ~j E F + for 1 _ i_< d-1 and l_<j_< d, and d>_ 1. The notion of matching previously defined is now extended in such a way that, for a, b E P,, a matches b if a = b. Then the interpretation of the resulting transformation is the usual one. The parameter d in (5) is called the number of alternations of the transformation. We have established the following results: * transformations with a bounded number of alternations can be learned in polynomial time; * learning transformations with an unbounded number of alternations is NP-hard.</Paragraph> <Paragraph position="3"> Again, we only give an outline of the proof below.</Paragraph> <Paragraph position="4"> The first result is easy to show, by observing that in an aligned corpus there are polynomially many occurrences of transformations with a bounded number of alternations. The second result holds even if we restrict ourselves to IEI = 2 and Irl = 1, that is if we use a don~t care symbol. Here we introduce a decision problem associated with the optimization problem of learning the transformations with the highest, score, and outline an NP-completeness proof.</Paragraph> </Section> </Section> <Section position="7" start_page="449" end_page="449" type="metho"> <SectionTitle> TRANSFORMATION SCORING (TS) </SectionTitle> <Paragraph position="0"> Instance: (L,K), with L an aligned corpus, K a positive integer.</Paragraph> <Paragraph position="1"> Question: Is there a transformation that has score greater than or equal to K w.r.t. L? Membership in NP is easy to establish for TS. To show NP-hardness, we consider the CLIQUE decision problem for undirected, simple, connected graphs and transform such a problem to the TS problem. (The NP-completeness for .the used restriction of the CLIQUE problem (Garey and Johnson, 1979) is easy to establish.) Let (G,K') be an instance of the CLIQUE problem as above, G = (V, E) and K' > 0. Without loss of generality, we assume that V = {1,2,...,q}. Let E = {a,b}; we construct an instance of the TS problem (L, K} over E as follows. For each {i, j} E V with i < j let wi,j = ai-lbaJ-i-lba q-j. (6) We add to the aligned corpus L: 1. one instance of pair Pi,j = (awl j, bwi,j) for each i < j, {i,j} E E; 2. q2 instances of pair Pi,j = (awi,j, awi,j) for each i,j E Y with i < j and {i,j} ~ E; 3. q2 instances of pair Pa = (aaa, ban).</Paragraph> <Paragraph position="2"> Also, we set K = q2 + (~'). The above instance of TS can easily be constructed in polynomial deterministic time with respect to the length of (G, K'}. It is easy to show that when (G, K') is a positive instance of the source problem, then the corresponding instance of TS is satisfied by at least one transformation. Assume now that there exists a transformation r having score greater equal than K > 0, w.r.t.L. Since the replacement of a with b is the only rewriting that appears in pairs of L, r must have the form a7 --+ b --. If 7 includes some occurrence of b, then r cannot match Pa and the positive evidence of r will not exceed IEI < (3) < K, contrary to our assumption. We then conclude that 7 has the form (? denotes the don't care symbol): aJl-l?aJ~-ji-1 ? ...?.aq'-Ja, where V&quot; = {ji,...,Jd} C_ V, d > 0 and q' < q. If there exists i, j E V&quot; such that {-i, j} ~ E, then r would match some pair Pi,j E L and it would have negative evidence smaller or equal than q2. Since the positive evidence of r cannot exceed q2 + IEI, r would have a score not exceeding IEI < (q) < If, contrary to our assumption. Then r matches no pair Pij E L and, for each i,j E V&quot;, we have {i,j} E E. = K' (K') Since K - q2 ( 2 ), at least pairs Pi,j E L are matched by r. We therefore conclude that d > K' and that V&quot; is a clique in G of size greater equal than K'. This concludes our outline of the proof.</Paragraph> </Section> <Section position="8" start_page="449" end_page="450" type="metho"> <SectionTitle> 5 Concluding remarks </SectionTitle> <Paragraph position="0"> With some minor technical changes to function Up_link_down, we can align a suffix tree with itself (w.r.t. a given homomorphism). In this way we improve space performance of Algorithms 2 and 3, avoiding the construction of two copies of the same suffix tree. Algorithm 3 can trivially be adapted to learn transformations in (4) where a left context is specified in place of a right context. The algorithm can also be used to learn traditional phonological rules of the form a --* b / _7, where a,b are single phonemes and &quot;/is a sequence over {C, V}, the classes of consonants and vowels. In this case the algorithm runs in time O(Nn) (for fixed alphabet).</Paragraph> <Paragraph position="1"> We leave it as an open problem whether rules of the form in (4) can be learned in linear time.</Paragraph> <Paragraph position="2"> We have been concerned with learning the best transformations that should be applied at a given step. An ordered sequence of transformations can be learned by iteratively learning a single transformation and by processing the aligned corpus with the transformation just learned (Brill, 1995). Dynamic techniques for processing the aligned corpus were first proposed in (Ramshaw and Marcus, 1996) to re-edit the corpus only where needed. Those authors report that this is not space efficient if transformation learning is done by independently testing all possible transformations in the search space (as in (Brill, 1995)). The suffix tree alignment data structure allows simultaneous scoring for all transformations. We can now take advantage of this and design dynamical algorithms that re-edit a suffix tree alignment only where needed, on the line of a similar method for suffix trees in (McCreight, 1976).</Paragraph> <Paragraph position="3"> An alternative data structure to suffix trees for the representations of string factors, called DAWG, has been presented in (Blumer et al., 1985). We point out here that, because a DAWG is an acyclic graph rather than a tree, straightforward ways of defining alignment between two DAWGs results in a quadratic number of a-links, making DAWGs much less attractive than suffix trees for factor alignment. We believe that suffix tree alignments are a very flexible data structure, and that other transformations could be efficiently learned using these structures.</Paragraph> <Paragraph position="4"> We do not regard the result in Section 4.3 as a negative one, since general transformations specified as in (5) seem too powerful for the proposed applications in natural language processing, and learning might result in corpus overtraining.</Paragraph> <Paragraph position="5"> Other than transformation based systems the methods presented in this paper can be used for learning rules of constraint grammars (Karlsson et al., 1995), phonological rule systems as in (Kaplan and Kay, 1994), and in general those grammatical systems using constraints represented by means of rewriting rules. This is the case whenever we can encode the alphabet of the corpus in such a way that alignment is possible.</Paragraph> </Section> class="xml-element"></Paper>