File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-2004_metho.xml
Size: 60,095 bytes
Last Modified: 2025-10-06 14:13:58
<?xml version="1.0" standalone="yes"?> <Paper uid="J95-2004"> <Title>Deterministic Part-of-Speech Tagging with Finite-State Transducers</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> Finite-state devices have important applications to many areas of computer science, including pattern matching, databases, and compiler technology. Although their linguistic adequacy to natural language processing has been questioned in the past (Chomsky, 1964), there has recently been a dramatic renewal of interest in the application of finite-state devices to several aspects of natural language processing. This renewal of interest is due to the speed and compactness of finite-state representations. This efficiency is explained by two properties: finite-state devices can be made deterministic, and they can be turned into a minimal form. Such representations have been successfully applied to different aspects of natural language processing, such as morphological analysis and generation (Karttunen, Kaplan, and Zaenen 1992; Clemenceau 1993), parsing (Roche 1993; Tapanainen and Voutilainen 1993), phonology (Laporte 1993; Kaplan and Kay 1994) and speech recognition (Pereira, Riley, and Sproat 1994). Although finite-state machines have been used for part-of-speech tagging (Tapanainen and Voutilainen 1993; Silberztein 1993), none of these approaches has the same flexibility as stochastic techniques. Unlike stochastic approaches to part-of-speech tagging (Church 1988; Kupiec 1992; Cutting et al. 1992; Merialdo 1990; DeRose 1988; Weischedel et al. 1993), up to now the knowledge found in finite-state taggers has been handcrafted and was not automatically acquired.</Paragraph> <Paragraph position="1"> Recently, Brill (1992) described a rule-based tagger that performs as well as taggers based upon probabilistic models and overcomes the limitations common in rule-based approaches to language processing: it is robust and the rules are automatically ac* Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139. E-mail: rocbe/schabes@merl.com.</Paragraph> <Paragraph position="2"> (~) 1995 Association for Computational Linguistics Computational Linguistics Volume 21, Number 2 quired. In addition, the tagger requires drastically less space than stochastic taggers. However, current implementations of Brill's tagger are considerably slower than the ones based on probabilistic models since it may require RKn elementary steps to tag an input of n words with R rules requiring at most K tokens of context.</Paragraph> <Paragraph position="3"> Although the speed of current part-of-speech taggers is acceptable for interactive systems where a sentence at a time is being processed, it is not adequate for applications where large bodies of text need to be tagged, such as in information retrieval, indexing applications, and grammar-checking systems. Furthermore, the space required for part-of-speech taggers is also an issue in commercial personal computer applications such as grammar-checking systems. In addition, part-of-speech taggers are often being coupled with a syntactic analysis module. Usually these two modules are written in different frameworks, making it very difficult to integrate interactions between the two modules.</Paragraph> <Paragraph position="4"> In this paper, we design a tagger that requires n steps to tag a sentence of length n, independently of the number of rules and the length of the context they require.</Paragraph> <Paragraph position="5"> The tagger is represented by a finite-state transducer, a framework that can also be the basis for syntactic analysis. This finite-state tagger will also be found useful when combined with other language components, since it can be naturally extended by composing it with finite-state transducers that could encode other aspects of natural language syntax.</Paragraph> <Paragraph position="6"> Relying on algorithms and formal characterizations described in later sections, we explain how each rule in Brill's tagger can be viewed as a nondeterministic finite-state transducer. We also show how the application of all rules in Brill's tagger is achieved by composing each of these nondeterministic transducers and why nondeterminism arises in this transducer. We then prove the correctness of the general algorithm for determinizing (whenever possible) finite-state transducers, and we successfully apply this algorithm to the previously obtained nondeterministic transducer. The resulting deterministic transducer yields a part-of-speech tagger that operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to follow a single path in this deterministic finite-state machine. We also show how the lexicon used by the tagger can be optimally encoded using a finite-state machine.</Paragraph> <Paragraph position="7"> The techniques used for the construction of the finite-state tagger are then formalized and mathematically proven correct. We introduce a proof of soundness and completeness with a worst-case complexity analysis for the algorithm for determinizing finite-state transducers.</Paragraph> <Paragraph position="8"> We conclude by proving that the method can be applied to the class of transformation-based error-driven systems.</Paragraph> </Section> <Section position="3" start_page="0" end_page="229" type="metho"> <SectionTitle> 2. Overview of Brill's Tagger </SectionTitle> <Paragraph position="0"> Brill's tagger is comprised of three parts, each of which is inferred from a training corpus: a lexical tagger, an unknown word tagger, and a contextual tagger. For purposes of exposition, we will postpone the discussion of the unknown word tagger and focus mainly on the contextual rule tagger, which is the core of the tagger.</Paragraph> <Paragraph position="1"> The lexical tagger initially tags each word with its most likely tag, estimated by examining a large tagged corpus, without regard to context. For example, assuming that vbn is the most likely tag for the word &quot;killed&quot; and vbd for &quot;shot,&quot; the lexical tagger might assign the following part-of-speech tags: 1 Since the lexical tagger does not use any contextual information, many words can be tagged incorrectly. For example, in (1), the word &quot;killed&quot; is erroneously tagged as a verb in past participle form, and in (2), &quot;shot&quot; is incorrectly tagged as a verb in past tense.</Paragraph> <Paragraph position="2"> Given the initial tagging obtained by the lexical tagger, the contextual tagger applies a sequence of rules in order and attempts to remedy the errors made by the initial tagging. For example, the rules in Figure 1 might be found in a contextual tagger.</Paragraph> <Paragraph position="3"> The first rule says to change tag vbn to vbd if the previous tag is np. The second rule says to change vbd to tag vbn if the next tag is by. Once the first rule is applied, the tag for &quot;killed&quot; in (1) and (3) is changed from vbn to vbd and the following tagged sentences are obtained: It is relevant to our following discussion to note that the application of the NEXT-TAG rule must look ahead one token in the sentence before it can be applied, and that the application of two rules may perform a series of operations resulting in no net change. As we will see in the next section, these two aspects are the source of local nondeterminism in Brill's tagger.</Paragraph> <Paragraph position="4"> The sequence of contextual rules is automatically inferred from a training corpus. A list of tagging errors (with their counts) is compiled by comparing the output of the lexical tagger to the correct part-of-speech assignment. Then, for each error, it is determined which instantiation of a set of rule templates results in the greatest error reduction. Then the set of new errors caused by applying the rule is computed and the process is repeated until the error reduction drops below a given threshold.</Paragraph> <Paragraph position="5"> Ku~era 1982): pps stands for singular nominative pronoun in third person, vbd for verb in past tense, np for proper noun, vbn for verb in past participle form, by for the word &quot;by,&quot; at for determiner, nn for singular noun, and bedz for the word &quot;was.&quot; change A to B if previous tag is C change A to B if previous one or two or three tag is C change A to B if previous one or two tag is C change A to B if next one or two tag is C change A to B if next tag is C change A to B if surrounding tags are C and D change A to B if next bigram tag is C D change A to B if previous bigram tag is C D Figure 2 Contextual rule templates.</Paragraph> <Paragraph position="6"> iii iilD \] C \[C IA \[</Paragraph> <Paragraph position="8"> Using the set of contextual rule templates shown in Figure 2, after training on the Brown Corpus, 280 contextual rules are obtained. The resulting rule-based tagger performs as well as state-of-the-art taggers based upon probabilistic models. It also overcomes the limitations common in rule-based approaches to language processing: it is robust, and the rules are automatically acquired. In addition, the tagger requires drastically less space than stochastic taggers. However, as we will see in the next section, Brill's tagger is inherently slow.</Paragraph> </Section> <Section position="4" start_page="229" end_page="233" type="metho"> <SectionTitle> 3. Complexity of Brill's Tagger </SectionTitle> <Paragraph position="0"> Once the lexical assignment is performed, in Brill's algorithm, each contextual rule acquired during the training phase is applied to each sentence to be tagged. For each individual rule, the algorithm scans the input from left to right while attempting to match the rule.</Paragraph> <Paragraph position="1"> This simple algorithm is computationally inefficient for two reasons. The first reason for inefficiency is the fact that an individual rule is compared at each token of the input, regardless of the fact that some of the current tokens may have been previously examined when matching the same rule at a previous position. The algorithm treats each rule as a template of tags and slides it along the input, one word at a time.</Paragraph> <Paragraph position="2"> Consider, for example, the rule A B PREVBIGRAM C C that changes tag A to tag B if the previous two tags are C.</Paragraph> <Paragraph position="3"> When applied to the input CDCCA, the pattern CCA is compared three times to the input, as shown in Figure 3. At each step no record of previous partial matches or mismatches is remembered. In this example, C is compared with the second input token D during the first and second steps, and therefore, the second step could have been skipped by remembering the comparisons from the first step. This method is similar to a naive pattern-matching algorithm.</Paragraph> <Paragraph position="4"> The second reason for inefficiency is the potential interaction between rules. For example, when the rules in Figure 1 are applied to sentence (3), the first rule results Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging in a change (6) that is undone by the second rule as shown in (9). The algorithm may therefore perform unnecessary computation.</Paragraph> <Paragraph position="5"> In summary, Brill's algorithm for implementing the contextual tagger may require RKn elementary steps to tag an input of n words with R contextual rules requiring at most K tokens of context.</Paragraph> <Paragraph position="6"> 4. Construction of the Finite-State Tagger We show how the function represented by each contextual rule can be represented as a nondeterministic finite-state transducer and how the sequential application of each contextual rule also corresponds to a nondeterministic finite-state transducer being the result of the composition of each individual transducer. We will then turn the nondeterministic transducer into a deterministic transducer. The resulting part-of-speech tagger operates in linear time independent of the number of rules and the length of the context. The new tagger operates in optimal time in the sense that the time to assign tags to a sentence corresponds to the time required to follow a single path in the resulting deterministic finite-state machine.</Paragraph> <Paragraph position="7"> Our work relies on two central notions: the notion of a finite-state transducer and the notion of a subsequential transducer. Informally speaking, a finite-state transducer is a finite-state automaton whose transitions are labeled by pairs of symbols. The first symbol is the input and the second is the output. Applying a finite-state transducer to an input consists of following a path according to the input symbols while storing the output symbols, the result being the sequence of output symbols stored. Section 8.1 formally defines the notion of transducer.</Paragraph> <Paragraph position="8"> Finite-state transducers can be composed, intersected, merged with the union operation and sometimes determinized. Basically, one can manipulate finite-state transducers as easily as finite-state automata. However, whereas every finite-state automaton is equivalent to some deterministic finite-state automaton, there are finite-state transducers that are not equivalent to any deterministic finite-state transducer. Transductions that can be computed by some deterministic finite-state transducer are called subsequential functions. We will see that the final step of the compilation of our tagger consists of transforming a finite-state transducer into an equivalent subsequential transducer.</Paragraph> <Paragraph position="9"> We will use the following notation when pictorially describing a finite-state transducer: final states are depicted with two concentric circles; e represents the empty string; on a transition from state i to state j, a/b indicates a transition on input symbol a and output symbol(s) b; a a question mark (?) on an input transition (for example labeled ?/b) originating at state i stands for any input symbol that does not appear as input symbol on any other outgoing arc from i. In this document, each depicted finite-state transducer will be assumed to have a single initial state, namely the leftmost state (usually labeled 0).</Paragraph> <Paragraph position="10"> We are now ready to construct the tagger. Given a set of rules, the tagger is constructed in four steps.</Paragraph> <Paragraph position="11"> The first step consists of turning each contextual rule found in Brill's tagger into a finite-state transducer. Following the example discussed in Section 2, the functionality of the rule vbn vbd PREVTAG np is represented by the transducer shown on the left of Each contextual rule is defined locally; that is, the transformation it describes must be applied at each position of the input sequence. For instance, the rule A B PREVIOR2TAG C, which changes A into B if the previous tag or the one before is C, must be applied twice on C A A (resulting in the output C B B). As we have seen in the previous section, this method is not efficient.</Paragraph> <Paragraph position="12"> The second step consists of turning the transducers produced by the preceding step into transducers that operate globally on the input in one pass. This transformation is performed for each transducer associated with each rule. Given a function fl that transforms, say, a into b (i.e. fl(a) = b), we want to extend it to a function f2 such that f2(w) = w / where w' is the word built from the word w where each occurrence of a has been replaced by b. We say that f2 is the local extension 3 of fl, and we write f2 = LocExt(fl). Section 8.2 formally defines this notion and gives an algorithm for computing the local extension.</Paragraph> <Paragraph position="13"> Referring to the example of Section 2, the local extension of the transducer for the rule vbn vbd PREVTAG np is shown to the right of Figure 4. Similarly, the transducer for the contextual rule vbd vbn NEXTTAG by and its local extension are shown in Figure 5. The transducers obtained in the previous step still need to be applied one after the other.</Paragraph> <Paragraph position="14"> Example of a transducer not equivalent to any subsequential transducer.</Paragraph> <Paragraph position="15"> The third step combines all transducers into one single transducer. This corresponds to the formal operation of composition defined on transducers. The formalization of this notion and an algorithm for computing the composed transducer are well known and are described originally by Elgot and Mezei (1965).</Paragraph> <Paragraph position="16"> Returning to our running example of Section 2, the transducer obtained by composing the local extension of T2 (right in Figure 5) with the local extension of T1 (right in Figure 4) is shown in Figure 6.</Paragraph> <Paragraph position="17"> The fourth and final step consists of transforming the finite-state transducer obtained in the previous step into an equivalent subsequential (deterministic) transducer. The transducer obtained in the previous step may contain some nondeterminism. The fourth step tries to turn it into a deterministic machine. This determinization is not always possible for any given finite-state transducer. For example, the transducer shown in Figure 7 is not equivalent to any subsequential transducer. Intuitively speaking, this transducer has to look ahead an unbounded distance in order to correctly generate the output. This intuition will be formalized in Section 9.2.</Paragraph> <Paragraph position="18"> However, as proven in Section 10, the rules inferred in Brill's tagger can always be turned into a deterministic machine. Section 9.1 describes an algorithm for determinizing finite-state transducers. This algorithm will not terminate when applied to transducers representing nonsubsequential functions.</Paragraph> <Paragraph position="19"> In our running example, the transducer in Figure 6 has some nondeterministic paths. For example, from state 0 on input symbol vbd, two possible emissions are possible: vbn (from 0 to 2) and vbd (from 0 to 3). This nondeterminism is due to the rule vbd vbn NEXTTAG by, since this rule has to read the second symbol before it can know which symbol must be emitted. The deterministic version of the transducer T3 is shown in Figure 8. Whenever nondeterminism arises in T3, the deterministic machine Computational Linguistics Volume 21, Number 2 Figure 8 Subsequential form for T3.</Paragraph> <Paragraph position="20"> ?/vbd,? emits the empty symbol C/, and postpones the emission of the output symbol. For example, from the start state 0, the empty string is emitted on input vbd, while the current state is set to 2. If the following word is by, the two token string vbn by is emitted (from 2 to 0), otherwise vbd is emitted (depending on the input from 2 to 2 or from 2 to 0).</Paragraph> <Paragraph position="21"> Using an appropriate implementation for finite-state transducers (see Section 11), the resulting part-of-speech tagger operates in linear time, independently of the number of rules and the length of the context. The new tagger therefore operates in optimal time.</Paragraph> <Paragraph position="22"> We have shown how the contextual rules can be implemented very efficiently. We now turn our attention to lexical assignment, the step that precedes the application of the contextual transducer. This step can also be made very efficient.</Paragraph> </Section> <Section position="5" start_page="233" end_page="233" type="metho"> <SectionTitle> 5. Lexical Tagger </SectionTitle> <Paragraph position="0"> The first step of the tagging process consists of looking up each word in a dictionary.</Paragraph> <Paragraph position="1"> Since the dictionary is the largest part of the tagger in terms of space, a compact representation is crucial. Moreover, the lookup process has to be very fast too---otherwise the improvement in speed of the contextual manipulations would be of little practical interest.</Paragraph> <Paragraph position="2"> To achieve high speed for this procedure, the dictionary is represented by a deterministic finite-state automaton with both fast access and small storage space. Suppose one wants to encode the sample dictionary of Figure 9. The algorithm, as described by Revuz (1991), consists of first building a tree whose branches are labeled by letters and whose leaves are labeled by a list of tags (such as nn vb), and then minimizing it into a directed acyclic graph (DAG). The result of applying this procedure to the sample dictionary of Figure 9 is the DAG of Figure 10. When a dictionary is represented as a DAG, looking up a word in it consists simply of following one path in the DAG.</Paragraph> <Paragraph position="3"> The complexity of the lookup procedure depends only on the length of the word; in particular, it is independent of the size of the dictionary.</Paragraph> <Paragraph position="4"> The lexicon used in our system encodes 54, 000 words. The corresponding DAG takes 360Kb of space and provides an access time of 12, 000 words per second. 4</Paragraph> </Section> <Section position="6" start_page="233" end_page="234" type="metho"> <SectionTitle> 4 The size of the dictionary in plain text (ASCII form) is 742KB. </SectionTitle> <Paragraph position="0"/> </Section> <Section position="7" start_page="234" end_page="234" type="metho"> <SectionTitle> 6. Tagging Unknown Words </SectionTitle> <Paragraph position="0"> The rule-based system described by Brill (1992) contains a module that operates after all known words--that is, words listed in the dictionary--have been tagged with their most frequent tag, and before contextual rules are applied. This module guesses a tag for a word according to its suffix (e.g. a word with an &quot;ing&quot; suffix is likely to be a verb), its prefix (e.g. a word starting with an uppercase character is likely to be a proper noun), and other relevant properties.</Paragraph> <Paragraph position="1"> This module basically follows the same techniques as the ones used to implement the lexicon. Because of the similarity of the methods used, we do not provide further details about this module.</Paragraph> </Section> <Section position="8" start_page="234" end_page="235" type="metho"> <SectionTitle> 7. Empirical Evaluation </SectionTitle> <Paragraph position="0"> The tagger we constructed has an accuracy identical s to Brill's tagger and comparable to statistical-based methods. However, it runs at a much higher speed. The tagger runs nearly ten times faster than the fastest of the other systems. Moreover, the finite-state tagger inherits from the rule-based system its compactness compared with a stochastic tagger. In fact, whereas stochastic taggers have to store word-tag, bigram, and trigram probabilities, the rule-based tagger and therefore the finite-state one only have to encode a small number of rules (between 200 and 300).</Paragraph> <Paragraph position="1"> We empirically compared our tagger with Eric Brill's implementation of his tagger, and with our implementation of a trigram tagger adapted from the work of Church (1988) that we previously implemented for another purpose. We ran the three programs on large files and piped their output into a file. In the times reported, we included the time spent reading the input and writing the output. Figure 11 summarizes the results. All taggers were trained on a portion of the Brown corpus. The experiments were run on an HP720 with 32MB of memory. In order to conduct a fair comparison, the dictionary lookup part of the stochastic tagger has also been implemented using the techniques described in Section 5. All three taggers have approximately the same 5 Our current implementation is functionally equivalent to the tagger as described by Brill (1992). However, the tagger could be extended to include recent improvements described in more recent papers (Brill 1994).</Paragraph> <Paragraph position="2"> Speeds of the different parts of the program.</Paragraph> <Paragraph position="3"> precision (95% of the tags are correct). 6 By design, the finite-state tagger produces the same output as the rule-based tagger. The rule-based tagger--and the finite-state tagger--do not always produce the exact same tagging as the stochastic tagger (they do not make the same errors); however, no significant difference in performance between the systems was detected. 7 Independently, Cutting et aL (1992) quote a performance of 800 words per second for their part-of-speech tagger based on hidden Markov models.</Paragraph> <Paragraph position="4"> The space required by the finite-state tagger (815KB) is distributed as follows: 363KB for the dictionary, 440KB for the subsequential transducer and 12KB for the module for unknown words.</Paragraph> <Paragraph position="5"> The speeds of the different parts of our system are shown in Figure 12. 8 Our system reaches a performance level in speed for which other, very low-level factors (such as storage access) may dominate the computation. At such speeds, the time spent reading the input file, breaking the file into sentences, breaking the sentences into words, and writing the result into a file is no longer negligible.</Paragraph> </Section> <Section position="9" start_page="235" end_page="241" type="metho"> <SectionTitle> 8. Finite-State Transducers </SectionTitle> <Paragraph position="0"> The methods used in the construction of the finite-state tagger described in the previous sections were described informally. In the following section, the notion of finite-state transducer and the notion of local extension are defined. We also provide an algorithm for computing the local extension of a finite-state transducer. Issues related to the determinization of finite-state transducers are discussed in the section following this one.</Paragraph> <Section position="1" start_page="235" end_page="237" type="sub_section"> <SectionTitle> 8.1 Definition of Finite-State Transducers </SectionTitle> <Paragraph position="0"> A finite-state transducer T is a five-tuple (~, Q, i,F, E) where: G is a fnite alphabet; Q is a finite set of states or vertices; i c Q is the initial state; F C Q is the set of final states;</Paragraph> <Paragraph position="2"> 6 For evaluation purposes, we randomly selected 90% of the Brown corpus for training purposes and 10% for testing.</Paragraph> <Paragraph position="3"> 7 An extended discussion of the precision of the rule-based tagger can be found in Brill (1992). 8 In Figure 12, the dictionary lookup includes reading the file, splitting it into sentences, looking up each word in the dictionary, and writing the final result to a file. The dictionary lookup and the tagging of unknown words take roughly the same amount of time, but since the second procedure only applies on unknown words (around 10% in our experiments), the percentage of time it takes is much smaller.</Paragraph> <Paragraph position="5"> A finite-state transducer T also defines a function on words in the following way: the extended set of edges F., the transitive closure of E, is defined by the following recursive relation:</Paragraph> <Paragraph position="7"> Then the function f from G* to ~* defined byf(w) = w' iff 3q E F such that (i,w,w',q) E /~ is the function defined by T. One says that T represents f and writes f = ITI.</Paragraph> <Paragraph position="8"> The functions on words that are represented by finite-state transducers are called rational functions. If, for some input w, more than one output is allowed (e.g. f(w) = {Wl, w2 .... }) then f is called a rational transduction.</Paragraph> <Paragraph position="9"> In the example of Figure 13, IT41 is defined by IT4i(ah) = bh and IT4i(ae) = ce.</Paragraph> <Paragraph position="10"> Given a finite-state transducer T = (~, Q, i,F, E), the following additional notions are useful: its state transition function d that maps Q x (G u {C/}) into 2 Q defined by d(q,a) = Cq' E Q I 3w' E G* and (q,a,w',q') E E}; and its emission function ~ that maps Q x (G u {~}) x Q into 2 ~&quot; defined by 6(q,a,q') = {w' E G* I (q,a,w,',q') E E}.</Paragraph> <Paragraph position="11"> A finite-state transducer could be seen as a finite-state automaton, where each transition label is a pair. In this respect, T4 would be deterministic; however, since transducers are generally used to compute a function, a more relevant definition of determinism consists of saying that both the transition function d and the emission function ~ lead to sets containing at most one element, that is, Id(q,a)I < 1 and I~(q, a, qt)l < 1 (and that these sets are empty for a = ~). With this notion, if a finite-state transducer is deterministic, one can apply the function to a given word by deterministically following a single path in the transducer. Deterministic transducers are called subsequential transducers (Schfitzenberger 1977). 9 Given a deterministic transducer, we can define the partial functions q(r)a = q' iff d(q,a) ~ {q~} and q,a = w ~ iff 3q' E Q such that q @ a = q~ and 6(q, a, q~) = Cw~}. This leads to the definition of subsequential transducers: a subsequential transducer T' is a seven-tuple (G, Q,/, F, (r), *, p) where: ~, Q, i, F are defined as above; (r) is the deterministic state transition function that maps Q x on Q, one writes q(r)a = q~; * is the deterministic emission function that maps Q x ~ on Y,*, one writes q * a = w~; and the final emission function p maps F on G*, one writes ,(q) = w.</Paragraph> <Paragraph position="12"> For instance, T4 is not deterministic because d(0,a) = C1,2}, but it is equivalent to T5 represented Figure 14 in the sense that they represent the same function, i.e.</Paragraph> <Paragraph position="13"> 9 A sequential transducer is a deterministic transducer for which all states are final. Sequential transducers are also called generalized sequential machines (Eilenberg 1974).</Paragraph> <Paragraph position="14"> Top: Input. Middle: First factorization. Bottom: Second factorization.</Paragraph> <Paragraph position="15"> IT4\] =\]Ts\[. T5 is defined by T5 = ({a,b,c,h,e},(O, 1,2},O,{2},(r),,,p) where 0(r)a = 1, 0,a = C/, 1 (r)h = 2, 1 ,h = bh, 1@e = 2, 1 ,e = ce, and p(2) = ~.</Paragraph> </Section> <Section position="2" start_page="237" end_page="241" type="sub_section"> <SectionTitle> 8.2 Local Extension </SectionTitle> <Paragraph position="0"> In this section, we will see how a function that needs to be applied at all input positions can be transformed into a global function that needs to be applied once on the input.</Paragraph> <Paragraph position="1"> For instance, consider T6 of Figure 15. It represents the function f6 = \]T6\[ such that f6(ab) = bc and f6(bca) = dca. We want to build the function that, given a word w, each time w contains ab (i.e. ab is a factor of the word) (resp. bca), this factor is transformed into its image bc (resp. dca). Suppose, for instance, that the input word is w = aabcab, as shown in Figure 16, and that the factors that are in dom(f6) 1deg can be found according to two different factorizations: i.e. w I = a.w2. c-W211, where w2 -- ab, and wl = aa * w3 * b, where w3 = bca. The local extension of f6 will be the transduction that takes each possible factorization and transforms each factor according to f6, i.e. f6(w2) = bc and f6(w3) -= dca, and leaves the other parts unchanged; here this leads to two outputs: abccbc according to the first factorization, and aadcab according to the second factorization.</Paragraph> <Paragraph position="2"> The notion of local extension is formalized through the following definition.</Paragraph> <Paragraph position="3"> Local extension algorithm.</Paragraph> <Paragraph position="4"> Intuitively, if F = LocExt(f) and w E ~*, each factor of w in dom(f) is transformed into its image by f and the remaining part of w is left unchanged. If f is represented by a finite-state transducer T and LocExt(f) is represented by a finite-state transducer T', one writes T' = LocExt(T).</Paragraph> <Paragraph position="5"> It could also be seen that if &quot;YT is the identity function on * - (~* * dom(T) * ~*), then LocExt(T) = &quot;Tr &quot; (T. &quot;yw)*. 12 Figure 17 gives an algorithm that computes the local extension directly.</Paragraph> <Paragraph position="6"> The idea is that an input word is processed nondeterministically from left to right. Suppose, for instance, that we have the initial transducer T7 of Figure 18 and that we want to build its local extension, Ts of Figure 19.</Paragraph> <Paragraph position="7"> When the input is read, if a current input letter cannot be transformed at the initial state of T7 (the letter c for instance), it is left unchanged: this is expressed by the looping transition on the initial state 0 of Ts labeled ?/?.13 On the other hand, 12 In this last formula, the concatenation * stands for the concatenation of the graphs of each function; that is, for the concatenation of the transducers viewed as automata whose labels are of the form a/b. 13 As explained before, an input transition labeled by the symbol ? stands for all transitions labeled with a letter that doesn't appear as input on any outgoing arc from this state. A transition labeled ?/? stands if the input symbol, say a, can be processed at the initial state of T7, one doesn't know yet whether a will be the beginning of a word that can be transformed (e.g. ab) or whether it will be followed by a sequence that makes it impossible to apply the transformation (e.g. ac). Hence one has to entertain two possibilities, namely (1) we are processing the input according to T7 and the transitions should be a/b; or (2) we are within the identity and the transition should be a/a. This leads to two kind of states: the transduction states (marked transduction in the algorithm) and the identity states (marked identity in the algorithm). It can be seen in Figure 19 that this leads to a transducer that has a copy of the initial transducer and an additional part that processes the identity while making sure it could not have been transformed. In other words, the algorithm consists of building a copy of the original transducer and at the same time the identity function that operates on ~* - ~* * dom(T) * Y,*.</Paragraph> <Paragraph position="8"> Let us now see how the algorithm of Figure 17 applies step by step to the transducer T7 of Figure 18, producing the transducer T8 of Figure 19.</Paragraph> <Paragraph position="9"> In Figure 17, C'\[0\] = ({i}, identity) of line 1 states that state 0 of the transducer to be built is of type identity and refers to the initial state i = 0 of T7. q represents the current state and n the current number of states. In the loop do{...} while (q < n), one builds the transitions of each state one after the other: if the transition points to a state not already built, a new state is added, thus incrementing n. The program stops when all states have been inspected and when no additional state is created. The number of iterations is bounded by 2 Ilz\]l*2, where \[\[T\[I = \[Q\[ is the number of states of the original transducer. 14 Line 3 says that the current state within the loop is q and that this state for all the diagonal pairs a/a s.t. a is not an input symbol on any outgoing arc from this state. refers to the set of states S and is marked by the type type. In our example, at the first occurrence of this line, S is instantiated to {0} and type = identity. Line 5 adds the current identity state to the set of final states and a transition to the initial state for all letters that do not appear on any outgoing arc from this state. Lines 6-11 build the transitions from and to the identity states, keeping track of where this leads in the original transducer. For instance, a is a label that verifies the conditions of line 6. Thus a transition a/a is to be added to the identity state 2, which refers to 1 (because of the transition a/b of T7) and to i = 0 (because it is possible to start the transduction T7 from any identity state). Line 7 checks that this state doesn't already exist and adds it if necessary, e = n + + means that the arrival state for this transition, i.e. d(q, w), will be the last added state and that the number of states being built has to be incremented.</Paragraph> <Paragraph position="10"> Line 11 actually builds the transition between 0 and e = 2 labeled a/a. Lines 12-17 describe the fact that it is possible to start a transduction from any identity state. Here a transition is added to a new state, i.e. a/b to 3. The next state to be considered is 2 and it is built like state 0, except that the symbol b should block the current output. In fact, state 1 means that we already read a with a as output; thus, if one reads b, ab is at the current point, and since ab should be transformed into bc, the current identity transformation (that is a ~ a) should be blocked: this is expressed by the transition b/b that leads to state 1 (this state is a &quot;trash&quot; state; that is, it has no outgoing transition and it is not final).</Paragraph> <Paragraph position="11"> The following state is 3, which is marked as being of type transduction, which means that lines 19-27 should be applied. This consists simply of copying the transitions of the original transducer. If the original state was final, as for 4 = ({2}, transduction), an ~/~ transition to the initial state is added (to get the behavior of T+).</Paragraph> <Paragraph position="12"> The transducer T9 = LocExt(T6) of Figure 20 gives a more complete (and slightly more complex) example of this algorithm.</Paragraph> </Section> </Section> <Section position="10" start_page="241" end_page="249" type="metho"> <SectionTitle> 9. Determinization </SectionTitle> <Paragraph position="0"> The basic idea behind the determinization algorithm comes from Mehryar Mohri. is In this section, after giving a formalization of the algorithm, we introduce a proof of soundness and completeness, and we study its worst-case complexity.</Paragraph> <Section position="1" start_page="241" end_page="242" type="sub_section"> <SectionTitle> 9.1 Determinization Algorithm </SectionTitle> <Paragraph position="0"> In the following, for Wl, w 2 E Y~,*, Wl /~ W2 denotes the longest common prefix of wl and w2.</Paragraph> <Paragraph position="1"> The finite-state transducers we use in our system have the property that they can be made deterministic; that is, there exists a subsequential transducer that represents the same function. 16 If T = (~, Q, i, F, E) is such a finite-state transducer, the subsequential transducer T' = (E, Q', i', F', (r), ,, p) defined as follows will be later proved equivalent to T: Q~ c 2 QxE* . In fact, the determinization of the transducer is related to the determinization of FSAs in the sense that it also involves a power set construction. The difference is that one has to keep track of the set of states of the original transducer, one might be in and also of the words whose emission have been postponed. For instance, a state {(ql, Wl), (q2,w2)} means that this state corresponds to a path that leads to q~ and q2 in the original transducer and that the emission of wl (resp.</Paragraph> <Paragraph position="2"> w2) was delayed for ql (resp. q2).</Paragraph> <Paragraph position="3"> i' = {(i, ~)}. There is no postponed emission at the initial state.</Paragraph> <Paragraph position="4"> the emission function is defined by:</Paragraph> <Paragraph position="6"> This means that, for a given symbol, the set of possible emissions is obtained by concatenating the postponed emissions with the emission at the current state. Since one wants the transition to be deterministic, the actual emission is the longest common prefix of this set.</Paragraph> <Paragraph position="7"> the state transition function is defined by:</Paragraph> <Paragraph position="9"> correctness that p is properly defined.</Paragraph> <Paragraph position="10"> 15 Mohri (1994b) also gives a formalization of the algorithm. 16 As opposed to automata, a large class of finite-state transducers do not have any deterministic representation; they cannot be determinized.</Paragraph> <Paragraph position="11"> Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging The determinization algorithm of Figure 21 computes the above subsequential transducer.</Paragraph> <Paragraph position="12"> Let us now apply the determinization algorithm of Figure 21 on the finite-state transducer T4 of Figure 13 and show how it builds the subsequential transducer T10 of Figure 22. Line 1 of the algorithm builds the first state and instantiates it with the pair {(0, e)}. q and n respectively denote the current state and the number of states having been built so far. At line 5, one takes all the possible input symbols w; here only a is possible, w' of line 6 is the output symbol,</Paragraph> <Paragraph position="14"> on line 9, a new state e is created to which the transition labeled a/w = a/e points and n is incremented. On line 15, the program goes to the construction of the transitions of state 1. On line 5, d and e are then two possible symbols. The first symbol, h, at line 6, is such that w' is</Paragraph> <Paragraph position="16"> State 2 labeled {(2, e)} is thus added, and a transition labeled h/bh that points to state 2 is also added. The transition for the input symbol e is computed the same way.</Paragraph> <Paragraph position="17"> The subsequential transducer generated by this algorithm could in turn be minimized by an~'algorithm described in Mohri (1994a). However, in our case, the transducer is nearly minimal.</Paragraph> </Section> <Section position="2" start_page="242" end_page="247" type="sub_section"> <SectionTitle> 9.2 Proof of Correctness </SectionTitle> <Paragraph position="0"> Although it is decidable whether a function is subsequential or not (Choffrut 1977), the determinization algorithm described in the previous section does not terminate when run on a nonsubsequential function.</Paragraph> <Paragraph position="1"> Two issues are addressed in this section. First, the proof of soundness: the fact that if the algorithm terminates, then the output transducer is deterministic and represents the same function. Second, the proof of completeness: the algorithm terminates in the case of subsequential functions.</Paragraph> <Paragraph position="2"> Soundness and completeness are a consequence of the main proposition, which states that if a transducer T represents a subsequential function f, then the algorithm DeterminizeTransducer described in the previous section applied on T computes a subsequential transducer representing the same function.</Paragraph> <Paragraph position="3"> In order to simplify the proofs, we will only consider transducers that do not have e input transitions, that is E C Q x ~ x ~* x Q, and also without loss of generality,</Paragraph> <Paragraph position="5"> if 3(~, u) C/ S s.t. ~ C/ F then F' = F' U {q} and p(q) = u; foreach w such that 3(~,u) E S and d(~,w) 0 { Subsequential transducer T10 such that IT10I = IT4I . transducers that are reduced and that are deterministic in the sense of finite-state automata. 17 In order to prove this proposition, we need to establish some preliminary notations and lemmas.</Paragraph> <Paragraph position="6"> First we extend the definition of the transition function d, the emission function 6, the deterministic transition function @, and the deterministic emission function * on words in the classical way. We then have the following properties:</Paragraph> <Paragraph position="8"> 17 A transducer defines an automaton whose labels are the pairs &quot;input/output&quot;; this automaton is assumed to be deterministic.</Paragraph> <Paragraph position="9"> Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging q,ab = (q,a).(q(r)a),b For the following, it is useful to note that if IT I is a function, then 6 is a function too.</Paragraph> <Paragraph position="10"> The following lemma states an invariant that holds for each state S built within the algorithm. The lemma will later be used for the proof of soundness. Lemma 1 Let I = C'\[0\] be the initial state. At each iteration of the &quot;do&quot; loop in Determinize-Transducer, for each S --- C'\[q\] and for each w E ~* such that I (r) w = S, the following This proves (i).</Paragraph> <Paragraph position="11"> We now turn to (ii). Assuming that (i) and (ii) hold for S and w, then for each a E ~, let $1 = S (r) a; the algorithm (line 8) is such that</Paragraph> <Paragraph position="13"> We show that $1 c $2. Let (q',u') E $1, then 3(q,u) E S s.t. q' E d(q,a) and</Paragraph> <Paragraph position="15"> 6(i,w,q). 6(q,a,q'); that is, u' = (I*wa) -1. 6(i, wa, q'). Thus (q',u') E $2. Hence $1 c $2.</Paragraph> <Paragraph position="16"> We now show that $2 c $1. Let (q',u') E $2, and let q E d(i,w) be s.t. q' E d(q,a) and u = (I, w) -1 . 6(i,w,q) then (q,u) E S and since u' = (I* wa) -1 * 6(i, wa, q') = (s ,a) -1 * u . (q',u') E sl This concludes the proof of (ii). \[\] Computational Linguistics Volume 21, Number 2 The following lemma states a common property of the state S, which will be used in the complexity analysis of the algorithm.</Paragraph> <Paragraph position="17"> fi(i, w, q) and w2 = (I * w)-I . 6(i, w, q). Thus W 1 = W 2. \[\] The following lemma will also be used for soundness. It states that the final state emission function is indeed a function.</Paragraph> <Paragraph position="18"> Lemma 3 For each S built in the algorithm, if (q, u), (q', u') c S, then q, q' E F ~ u = u' Proof Let S be one state set built in line 8 of the algorithm. Suppose (q, u), (q', u') E S and q, q' E F. According to (ii) of lemma 1, u = (I,w) -1 .6(i,w,q) and u' = (I,w) -1.6(i,w,q'). Since IT\[ is a function and {6(i,w,q),6(i,w,q')} E ITl(w) then 6(i,w,q) = 6(i,w,q'), therefore u = uq \[\] The following lemma will be used for completeness.</Paragraph> <Paragraph position="19"> Lemma 4 Given a transducer T representing a subsequential function, there exists a bound M s.t. for each S built at line 8, for each (q,u) E S, lu\[ < M. We rely on the following theorem proven by Choffrut (1978): Theorem 1 A function f on G* is subsequential iff it has bounded variations and for any rational language L C ~*, f-1 (L) is also rational.</Paragraph> <Paragraph position="20"> with the following two definitions: Definition The left distance between two strings u and v is I\[u,v\[I = \[u\[ + Iv\[ - 2\[u/~ v\[. Definition A function f on G* has bounded variations iff for all k ~ 0, there exists K > 0 s.t. u,v C dom(f), \[\[u,v\[\[ <_ k ~ \]\[f(u),f(v)\[\[ <_ K.</Paragraph> <Paragraph position="21"> Proof of Lemma 4 Let f = IT\[. For each q E Q, let c(q) be a string w s.t. d(q,w) N F ~ 0 and s.t. \[w\[ is minimal among such strings. Note that \[c(q)\[ _< \[IT\[\] where \[IT\[\[ is the number of states Emmanuel Roche and Yves Schabes Deterministic Part-of-Speech Tagging in T. For each q c Q let s(q) E Q be a state s.t. s(q) c d(q,c(q)) AF. Let us further define</Paragraph> <Paragraph position="23"> Since f is subsequential, it is of bounded variations, therefore there exists K s.t. if \]\[u, vi\] ~ aM 2 then I\[f(u),f(v)\] I G K. Let M = K + 2M1.</Paragraph> <Paragraph position="24"> Let S be a state set built at line 8, let w be s.t. I(r)w = S and A = I,w. Let (ql, u) E S. Let (q2, v) C S be s.t. u A v = c. Such a pair always exists, since if not thus \[A.</Paragraph> <Paragraph position="25"> Moreover, for any a,b,c,d E ~*, Iia, ciI <_ \]lab, cd\[I + Ibl + \[d I. In fact, Ilab, cdiI = \[ab\[ + IcdI- 2Iab A cd I = lal + I c\] + IbI + IdI- 2Iab A cd I = II a, c\]I + 21a A c I + \[b I + \]d I -2lab A cd I but labAcd\] <_ laAcI +Ib\[+\]d\[ and since \]Iab, cd\[I = Ila, cI\[-2(\[abAcd I -\[aAc I -\[b I -IdI) - IbI- IdI one has Iia, cil < I\]ab, cdll + Ib\[ + Idl.</Paragraph> <Paragraph position="26"> Therefore, in particular, luI < \]\[Au, AvI\[ < JiAua;,Avw'\]\[ + \]0; I + Iw'I, thus I u\] < Iif(w * c(ql)),f(w, c(q2))I\] q- 2M1. But \]\[w. c(ql),W&quot; c(q2)ll G \]c(ql)\[ + Ic(q2)I ~ 2M2, thus Iif(w * c(ql)),f(w&quot; c(q2))\[\] < K and therefore I u\] < K + 2M 1 = M. \[\] The time is now ripe for the main proposition, which proves soundness and completeness. null Proposition If a transducer T represents a subsequential function f, then the algorithm DeterminizeTransducer described in the previous section applied on T computes a subsequential transducer ~- representing the same function.</Paragraph> <Paragraph position="27"> Proof Lemma 4 shows that the algorithm always terminates if IT\] is subsequential. Let us show that dom(iTI) c dom(iTI). Let w E ~* s.t. w is not in dom(iTI), then d(i, w) M F = 0. Thus, according to (ii) of Lemma 1, for all (q, u) c I (r) w, q is not in F, thus I (r) w is not terminal and therefore w is not in dom(~-).</Paragraph> <Paragraph position="28"> Conversely, let w E dom(iT\[). There exists a qf C F s.t. IT\](w) = 6(i,w, qf) and s.t. qf C d(i,w). Therefore \]Zi(w ) = (I, w). ((I* w) -1- 6(i,w, qf)) and according to (ii) of Lemma 1 (qf, (I * w) -I * 6(i,w, qf)) c I (r) w and since qf E F, Lemma 3 shows that</Paragraph> <Paragraph position="30"/> </Section> <Section position="3" start_page="247" end_page="249" type="sub_section"> <SectionTitle> Computational Linguistics Volume 21, Number 2 9.3 Worst-Case Complexity </SectionTitle> <Paragraph position="0"> In this section we give a worst-case upper bound of the size of the subsequential transducer in terms of the size of the input transducer.</Paragraph> <Paragraph position="1"> Let L = {w E G&quot; s.t. Iw\[ <__ M}, where M is the bound defined in the proof of Lemma 4. Since, according to Lemma 2, for each state set Q~, for each q E Q, Q' contains at most one pair (q, w), the maximal number N of states built in the algorithm is smaller than the sum of the number of functions from states to strings in L for each state set, that is N < ILl IQ't Q' E2Q we thus have N _< 2 IQI x ILl iQI -- 2 IQI x 2 \[Qlxldegg2 iLl and therefore N _< 2 IQl(l+ldegglLI). Moreover, M+' - 1 ILl = 1 + lye\] +... + ISl M - ISl - 1 if Is\] > 1 and ILl = M+I if = 1. In this last formula, M = K+2M1, as described in Lemma 4. Note that if P = MAXa~sl6(q,a, q')l is the maximal length of the simple transitions emissions, M1 ~ IQI x P, thus M _< K + 2 x IQI x P.</Paragraph> <Paragraph position="2"> Therefore, if \[E I > 1, the number of states N is bounded: i:gl(K+2 x IQI xP+1-1 ) N <_ 2 IQIx(l+ldegg i~l-, and if lee = 1, N ~ 2 \[QIx(l+ldegg(K+2xiQLxP+l)) 10. Subsequentiality of Transformation-Based Systems The proof of correctness of the determinization algorithm and the fact that the algorithm terminates on the transducer encoding Brill's tagger show that the final function is subsequential and equivalent to Brill's original tagger.</Paragraph> <Paragraph position="3"> In this section, we prove in general that any transformation-based system, such as those used by Brill, is a subsequential function. In other words, any transformation-based system can be turned into a deterministic finite-state transducer. We define transformation-based systems as follows.</Paragraph> <Paragraph position="4"> Definition A transformation-based system is a finite sequence (f\],...,fn) of subsequential functions whose domains are bounded.</Paragraph> <Paragraph position="5"> Applying a transformation-based system consists of applying each function fi one after the other. Applying one function consists of looking for the first position in the input at which the function can be triggered. When the function is triggered, the longest possible string starting at that position is transformed according to this function. After the string is transformed, the process is iterated starting at the end of the previously transformed string. Then, the next function is applied. The program ends when all functions have been applied.</Paragraph> <Paragraph position="6"> It is not true that, in general, the local extension of a subsequential function is subsequential. TM For instance, consider the function fa of Figure 23. 18 However, the local extensions of the functions we had to compute were subsequentiaL Function fa.</Paragraph> <Paragraph position="7"> a:b a:b a:b The local extension of the function fa is not a function. In fact, consider the input string daaaad; it can be decomposed either into d * aaa. ad or into da * aaa. d. The first decomposition leads to the output dbbbad, and the second one to the output dabbbd. The intended use of the rules in the tagger defined by Brill is to apply each function from left to right. In addition, if several decompositions are possible, the one that occurs first is the one chosen. In our previous example, it means that only the output dbbbad is generated.</Paragraph> <Paragraph position="8"> This notion is now defined precisely.</Paragraph> <Paragraph position="9"> Let a be the rational function defined by a(a) = a for a c ~, a(\[) = a(\]) = ~ on the additional symbols '\[' and '\]', with a such that a(u. v) = a(u). a(v). Definition Let Y c ~+ and X = ~* - ~*. Y. ~*, a Y-decomposition of x is a string y E X. (\[. Y. \]. X)* s.t. a(y) = x For instance, if Y = dom(fa) -- {aaa}, the set of Y-decompositions of x = daaad is { d \[aaa \]ad , da \[aaa \] d }.</Paragraph> <Paragraph position="10"> Definition Let < be a total order on P, and let ~ = ~ U {\[,\]} be the al _phabet ~ with the two additional symbols '\[' and '\]'. Let extend the order > to N by Va E ~, '\['< a and a < '\]'. < defines a lexicographic order on ~* that we also denote <. Let Y c 2 + and x c N*, the minimal Y-decomposition of x is the Y-decomposition which is minimal in (~*, <).</Paragraph> <Paragraph position="11"> For instance, the minimal dom(fa)-decomposition of daaaad is d\[aaa\]ad. In fact, d\[aaaJad < da\[aaa\]d.</Paragraph> <Paragraph position="12"> Let dec be defined by dec(w) = u. \[. v. 1. dec((uv) -1 . w), where u, v E P~* are s.t. v E Y, 3v' c ~* with w = uvv' and lul is minimal among such strings and dec(w) -- w if no such u, v exists. The function mdy is total because the function dec always returns an output that is a Y-decomposition of w.</Paragraph> <Paragraph position="13"> We shall now prove that the function is rational and then that it has bounded variations; this will prove according to Theorem 1 that the function is subsequential. In the following X = ~* - P,* * Y- P,*. The transduction Ty that generates the set of Y-decompositions is defined by Ty = Idx. (eft. Idy- c/\]. Idx)* where Idx (resp. Idy) stands for the identity function on X (resp. Y). Furthermore, the transduction TU,> that to each string w E ~* associates the set of strings strictly greater than w, that is T~,>(w) = {w' E ~*I w < w'}, is defined by the transducer of -- --2 -- Figure 24, in which A = {(x,x)ix E G}, B = {(x,y) E ~2\[x < y}, C = G , D = {C/} x and E = G x {c}. 19 Therefore, the right-minimal Y-decomposition function mdy is defined by mdy --Ty - (Tu,> o Ty), which proves that mdy is rational.</Paragraph> <Paragraph position="14"> Letk > 0. LetK = 6xk+6xM, whereM-- maxx~yix I. Let u,v E G* bes.t.</Paragraph> <Paragraph position="15"> Iiu, vII _< k. Let us consider two cases: (i) I u A v I _< M and (ii) lu A v I > M. (i): I u Av I _< M, thus \[uHv I ~ I u Av I + Iiu, vI\[< M+k. Moreover, for each w E Y~*, for each Y-decomposition w' of w, Iw'\[ _< 3 x \]w I. In fact, Y doesn't contain ~, thus the number of \[ (resp. l) in w' is smaller than Iw\[. Therefore, Imdy(u) I, Imdy(v)l <_ 3 x (M+k) thus \[Imdy(u),mdy(v)lI < K.</Paragraph> <Paragraph position="16"> (ii): u A v = ~ * a; with \[a; I = M. Let #, v be s.t. u = &w# and v = )~a;~. Let )~', w', #', .~&quot;, a;&quot; and v&quot; be s.t. mdy(u) = )~'J#', mdy(v) -- )~&quot;~;&quot;~,&quot;, c~(~') = ~(,V') = ~, c~(a;') = c~(~o&quot;) = w, o~(#') = # and ~(~,&quot;) = ~,. Suppose that &' # &&quot;, for instance ), < )i,. Let i be the first indice s.t. (;f)i < (,VI)i. 20 We have two possible situations: (ii.1) ()~r)i = \[ and ;~&quot; E ~ or (~')i ---- \]. In that case, since the length of the elements in Y is smaller than M = 14, one has &'~;' = .~1\[.~2\],~3 with \[~ll = i, ;~2 ~ Y and &quot;~3 E . We also have ),'w&quot; ' ' = /~1/~2/~ 3 with c~()~) = c~(&2) and the first letter of &quot;~2 is different from \[. Let )~4 be a Y-decomposition of ~ 3 y, then &1\[~2\]/~4 is a Y-decomposition of v strictly smaller than ~1 &~)~L,&quot; = mdy (v), which contradicts the minimality of mdy (v). The second situation is (ii.2): (&~)i E ~. and (&')i = \], then we have )~GJ = ~1\[,~2,~3\],~4 s.t. I,~1\[/~2I = i and ),'M&quot; = .,~1\[,~2\],,~&~ s.t. C~(&~) = C~()~3) and c~(&~) = c~(&4). Let As be a Y-decomposition of ~&quot;, then ;~ \[/~2/~3\]/~5 is a Y-decomposition of v strictly smaller than ,V'w&quot;~,&quot;, which leads to the same contradiction. Therefore, &' = )~&quot; and since I#'\[+1~,&quot;1 _< 3x (l#l+l~,l)--3x Ilu, vll _< 3x/, IImdyCu),mdy(v)ll <_ la;'l+la;&quot;l+l#'l+l~,r'l < 2 x M + 3 x k _< K. This proves that mdy has bounded variations and therefore that it is subsequential. \[\] N We can now define precisely what is the effect of a function when one applies it from left to right, as was done in the original tagger.</Paragraph> <Paragraph position="17"> 19 This construction is similar to the transduction built within the proof of Eilenberg's cross section local extension of f, denoted RmLocExt(f), is the composition of a right-minimal Y-decomposition mdy with Ida, * (\[/~. f. \]/~. Ida,)*.</Paragraph> <Paragraph position="18"> RmLocExt being the composition of two subsequential functions, it is itself subsequential; this proves the following final proposition, which states that given a rule-based system similar to Brill's system, one can build a subsequential transducer that represents it: Proposition If (fl .... ,fn) is a sequence of subsequential functions with bounded domains and such that fi(~) = 0, then RmLocExt(h ) o... o RmLocExt(fn) is subsequential.</Paragraph> <Paragraph position="19"> We have proven in this section that our techniques apply to the class of transformation-based systems. We now turn our attention to the implementation of finite-state transducers. null 11. Implementation of Finite-State Transducers Once the final finite-state transducer is computed, applying it to an input is straightforward: it consists of following the unique sequence of transitions whose left labels correspond to the input. However, in order to have a complexity fully independent of the size of the grammar and in particular independent of the number of transitions at each state, one should carefully choose an appropriate representation for the transducer. In our implementation, transitions can be accessed randomly. The transducer is first represented by a two-dimensional table whose rows are indexed by states and whose columns are indexed by the alphabet of all possible input letters. The content of the table at line q and at column a is the word w such that the transition from q with the input label a outputs w. Since only a few transitions are allowed from many states, this table is very sparse and can be compressed. This compression is achieved while maintaining random access using a procedure for sparse data tables following the method given by Tarjan and Yao (1979).</Paragraph> </Section> </Section> class="xml-element"></Paper>