File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1305_metho.xml
Size: 23,762 bytes
Last Modified: 2025-10-06 14:15:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1305"> <Title>[\] \[\] Incremental Construction of Minimal Acyclic State Automata and Transducers</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> (DEPARTMENT OF COMPUTER SCIENCE) </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> RIBBIT SOFTWARE SYSTEMS INC. (IST TECHNOLOGIES RESEARCH GROUP) </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Finite </SectionTitle> <Paragraph position="0"> Abstract. In this paper, we describe a new method for constructing minimal, deterministic, acyclic finite state automata and transducers. Traditional methods consist of two steps. The first one is to construct atrie, the second one -- to perform minimization. Our approach is to construct an automaton in a single step by adding new strings one by one and minimizing the resulting automaton on-the-fly. We present a general algorithm as well as a specialization that relies upon the lexicographical sorting of the input strings.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="48" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Finite state automata are used in a variety of applications, such as natural language processing (NLP). They may store sets of words or sets of words with annotations, such as the corresponding pronunciation, lexeme, morphotactic categories, et cetera. The main reasons for the use of finite state automata in the NLP domain are their small size and very short lookup time.</Paragraph> <Paragraph position="1"> Of particular interest to the NLP community are deterministic, acyclic, finite state automata, which we call dictionaries. We refer to the set of all such dictionary automata as DAFSA.</Paragraph> <Paragraph position="2"> Dictionaries can be constructed in various ways, using different data. (See Watson \[3, 5\] for a taxonomy of (general) finite state automata construction algorithms.) A word is simply a finite sequence of symbols over some alphabet (we do not associate them with a meaning during the construction phase). For the purpose of this article, the input data is a finite sequence of words. This is a necessary and suificient condition for any resulting deterministic automaton to be acyclic.</Paragraph> <Paragraph position="3"> The MyhiU-Nerode theorem (see Hopcroft and Ullman \[I\]) states that among the many automata that accept a given language, there is a unique automaton (excluding isomorphisms) that has a minimal number of states. This is called the minimal automaton of the language.</Paragraph> <Paragraph position="4"> The generalized algorithm presented in this paper has been independently developed by Jan Daciuk (he is also the sole developer of the sorted specialization of the algorithm) of the Techni-</Paragraph> <Paragraph position="6"> ca* University of Gdazisk and by Richard Watson and Bruce Watson of the IST Technologies Research Group at Ribbit Software Systems Inc. Jan Daciuk has made his C++ implementations of the algorithms freely available for research purposes at ~n~.pg. gda. pl/~j andac/f sa. html.</Paragraph> <Paragraph position="7"> Ribbit's commercial C++ and Java implementations are available via ~. RibbitSoft. com.</Paragraph> <Paragraph position="8"> Ribbit's implementations include several additional features such as a method to remove words from the dictionary (while maintaining *in*reality) and the ability to associate any type of annotation with a word in the dictionary (hence providing an efficient (p-)subsequential transducer implementation). In addition, it is possible to save a constructed dictionary and reload it on a different platform and implementation fan. gaiage (without endianess problems). The algorithms have been used for constructing dictionaries and transducers for spell checking, morphological analysis, two-level morphology, restoration of diacritics and perfect hashing. In addition, the algorithms have proven useful in numerous problems outside the field of NLP (for example, DNA sequence matching, computer virus recognition and document indexing).</Paragraph> </Section> <Section position="4" start_page="48" end_page="48" type="metho"> <SectionTitle> 2 Mathematical Preliminaries </SectionTitle> <Paragraph position="0"> Formally, we define a deterministic finite-state automaton to be a 5-tuple M = (Q, ,~, 6, q0, F), where Q is a finite set of states, qo * Q is the start state, F C Q is a set of final states,/7 is a finite set of symbols called the alphabet and 6 is a partial mapping ~ : Q x 27 ~ Q denoting transitions. We can extend the 6 mapping to 5&quot; : Q x 27* ----, Q as in Hopcroft and Ullman \[1\].</Paragraph> <Paragraph position="1"> We define E(M) to be the language accepted by automaton M:</Paragraph> <Paragraph position="3"> a state q in M (the set of all strings, over ,~*, on a path from state q to any final state of M using the extended transition relation 6*):</Paragraph> <Paragraph position="5"> Note that/:(M) = ~(q0). We also define a property of an automaton specifying that all states can be reached from the start state:</Paragraph> <Paragraph position="7"> The property of being a minimal automata is traditionally defined as follows (see Watson \[3, 5\]):</Paragraph> <Paragraph position="9"> We will, however, use an alternative definition of *in*reality, which is shown to be equivalent (see Watson \[3, 5\]):</Paragraph> <Paragraph position="11"> whose language is the French regular endings of verbs of the first group.</Paragraph> </Section> <Section position="5" start_page="48" end_page="52" type="metho"> <SectionTitle> 3 Construction from Sorted Data </SectionTitle> <Paragraph position="0"> A trie is a dictionary with a transition graph that is a tree with the start state as the root and all leaves being final. Let us picture a dictionary in a form of a trie (for example, see Figure 1).</Paragraph> <Paragraph position="1"> We can see that many subtrees in the transition graph are isomorphic. The equivalent minimal dictionary (Figure 2) is the one in which, for all isomorphic subtrees, only one copy of the tree is kept. That is, pointers (edges) to all isomorphic subtrees are replaced by pointers (edges) to their unique copy.</Paragraph> <Paragraph position="2"> Traditionally, to obtain a minimal dictionary one would first create a dictionary for the language (not necessarily minimal), and then minimize it using any one of a number of algorithms (see Watson \[4, 5\]). Usually, the first stage is done by building a trie, for which there are fast and well understood algorithms. Although algorithms that minimize dictionaries can be fairly effective in their use of memory, they unfortunately have bad run-time performance. In addition, the size of the original dictionary can be enormous - although some effort towards decreasing its memory requirements have been reported -- see Revuz \[2\]. This paper presents a way to reduce these intermediate memory requirements and decrease the total construction time by constructing the minimal dictionary incrementally (word by word, maintaining an invariant of minimality), thus avoiding ever having a trie in memory.</Paragraph> <Paragraph position="3"> The central part of most automata minimization algorithms is a classification of states -see Watson \[4, 5\]. The states of a dictionary are partitioned into equivalence classes of which the representatives are the states of the minimal dictionary. Assuming the original dictionary does not have any useless states (that is, Useful(M) is true), we can deduce (by our alternative definition of minimality) that each state in the minimal dictionary must havea unique right language. Since this is a necessary and suiBcient condition for minimality, we can use equality of right languages as our equivalence relation for our classes -- see Watson \[3, 5\]. Using our definition of right languages, it is easily shown that equality of right languages is an equivalence relation (reflexive, symmetric and transitive). We will denote two states, p and q, belonging to the same equivalence class by p -- q (note that ~ here is different from its use for logical equivalence of predicates).</Paragraph> <Paragraph position="4"> Let us step through the minimization of the trie 'in Figure 1 using the algorithm given in Hopcroft and Ullman \[1\] and Watson \[5\]. As a first step, pairs of states where one iS final and</Paragraph> <Paragraph position="6"> the other is not can immediately be marked as belonging to different equivalence classes (since only final states contain e, the empty string, in their right language). Pairs of states that have a dii~erent number of outgoing transitions or the same number but with different labels can also be marked as belonging to different equivalence classes. Finally, pairs of states that have transitions labeled with the same symbols but leading to different states that have already been considered, can be marked as belonging to different equivalence classes.</Paragraph> <Paragraph position="7"> Let us traverse the trie (see Figure 1) with the postorder method and see how the partition can be performed. We start With the (lexicographically) first leaf, moving backward through the trie toward the start state. All states up to the first forward-branching state (state with more than one outgoing transition) must belong to different classes. We can put them into a register of states so that we can find them easily. There will be no need to replace them by other states. Considering the other branches, and starting from their leaves, we need to know whether or not a given state belongs to the same class as a previously registered class. The state being considered belongs to the same class as a representative of an established class if and only if: 1. they are either both final, or both non-final. If there is an annotation or some other type of information associated with each state, then states in the same equivalence class must all have equivalent information; 2. they have the same number of outgoing transitions; 3. corresponding transitions have the same labels; 4. corresponding transitions lead to the same states, and 5. states reachable via outgoing transitions are the sole representatives of their classes. The last condition is satisfied by using the postorder method to traverse the trie. If all the conditions are satisfied, the state is replaced by the equivalent (representative) state found in the register. Replacing a state simply involves deleting the state while redirecting all of its in-transitions to the equivalent state. Note that all leaf states belong to the same equivalence class. If some of the conditions are not satisfied, the state must be a representative of a new class and therefore must be put into the register.</Paragraph> <Paragraph position="8"> In order to build the dictionary one word at a time, we need to merge the process of adding new words to the dictionary with the 'minimization process. There are two crucial questions that need to be answered. Firstly, which states (or equivalence classes) are subject to change when new words are added? Secondly, is there a way to add new words to the dictionary such that we minimize the number of states that may need to be changed during the addition of a word? Looking at the Figures 1 and 2, it becomes clear that in order to reproduce the same postorder traversal of states, the input data must be lexicographically sorted. (Note that in order to do this, the alphabet 27 must be ordered). Further investigation reveals that when we add words in this order, only the states that need to be traversed to accept the previous word added to the dictionary may change when a new word is added. All the rest of the dictionary remains unchanged. This discovery leads us to the algorithm shown in Algorithm 3.1.</Paragraph> <Paragraph position="10"> The function common_prefi~ finds the longest prefix (of the word to be added) that is a prefix of a word already in the automaton.</Paragraph> <Paragraph position="11"> The function add_suj~ creates a branch extending out of the dictionary, which represents the suffix of the word being added (the maximal SulFLX of the word which is not a prefix of any other word already in the dictionary). The last state of this branch is marked as final (and an annotation associated with it, if applicable ). The function last_child returns a (modifiable) reference to the state reached by the lexicographically last transition that is outgoing from the argument state. Since the input data is lexicographically sorted, last_child returns the outgoing transition (from the state) most recently added (during the addition of the previous word).</Paragraph> <Paragraph position="12"> To determine which states have already been processed, each state has a marker that indicates whether or not it is already registered. Some parts of the automaton are left for further treatment (replacement or registering) until some other word is added so that those states no longer belong to the path in the automaton that accepts the new word. That marker is read with marked_as_registered and set with mark_as_registered. Finally, has_children returns true if, and only if, there are outgoing transitions from the node, and delete_branch deletes its argument state and all states that can be reached from it (if they are not already marked as registered).</Paragraph> <Paragraph position="13"> Memory is needed for the minimized dictionary that is under construction, the call stack and for the register of states. The memory for the dictionary is proportional to the number of states and the total number of transitions. The memory for the register of states is proportional to the number of states and can be freed once construction is complete. Depending upon the choice of implementation method, memory may be required to maintain the equivalence relation.</Paragraph> <Paragraph position="14"> The main loop of the algorithm runs m times, where ra is the number of words to be accepted by the dictionary. The function common_prefix executes in O(Iwl) time, where \[~v I is the maximum word length. The function replace_or_register executes recursively at most Iwl times for each word. In each recursive call, there is one register search and possibly one register insertion. The pessimistic time complexity of the search is O(logn), where n is the number of states in the (minimized) dictionary. The pessimistic time -complexity of adding a state to the register is also O(log n). By using a hash table to represent the register (and equivalence relation), the average time complexity of those operations can be made constant. Since all Children of a state are either replaced or registered, delete_branch executes in constant time. So the pessimistic time complexity of the entire algorithm is O(mlw I logn), while an average time complexity of O(mlwl) can be achieved.</Paragraph> </Section> <Section position="6" start_page="52" end_page="53" type="metho"> <SectionTitle> 4 Construction from Unsorted Data </SectionTitle> <Paragraph position="0"> Sometimes it is di~cult or impossible to sort the input data before constructing a dictionary. For example, when there is not enough time or storage space to sort the data, or the data originates in another program or physical source. An incremental dictionary-building algorithm would still be very useful in those situations, although unsorted data makes it more difficult to merge the trie-building process and the minimization process. We could leave the two processes disjoint, although this would lead to the traditional method of constructing a trie and minimizing it afterwards. A better solution is to minimize everything on-the-fiy, possibly changing a state's equivalence class each time a word is added. Before actually constructing a new state in the dictionary, we first determine if it would be included in the equivalence class of a pre-existing state. In addition, we may need to change the equivalence classes of previously constructed states since their right languages may have changed. This leads to an incremental construction algorithm. Naturally, we would want to create the states for a new word in an order that would minimize the computation of the new equivalence classes.</Paragraph> <Paragraph position="1"> Similar to the algorithm for sorted data, when a new word is added, we search for the common prefix in the dictionary. This time, however, we cannot assume that the states traversed by this common prefix will not be changed by the addition of the word. If there are any pre-existing states traversed by the common prefix that are already targets of more than one in-transition (known as confluence states), then blindly appending another transition to the last state in this path (as we would in the sorted algoritm) would accidentally add more words than desired (see Figure 3 for an example of this).</Paragraph> <Paragraph position="2"> The middle dictionary inadvertently contains abe as well. The rightmost dictionary is correct -- state 3 had to be cloned.</Paragraph> <Paragraph position="4"> To avoid generation of such spurious words, all states in the common prefix from the first state that has more than one in-transition must be cloned. Cloning is the process of creating a new state that has outgoing transitions on the same labels and to the same destination states as a given state. If we compare the minimal dictionary ' to an equivalent trie, we notice that a confluence state can be seen as a root of several original, isomorphic subtrees merged into one (as described in the previous section). One of the isomorphisms now needs to be modified, so it must first be separated from the others by cloning its root. The isomorphic subtrees hanging off these roots are unchanged, so the original root and its clone have the same outgoing transitions (that is, transitions on the same labels and to the same destination states).</Paragraph> <Paragraph position="5"> Once the entire common prefix is traversed, possibly cloning states along the way, the rest of the word must be appended. If there are no confluence states in the common prefix, then the method of adding the rest of the word does not differ from the method used in the algorithm for sorted data. The addition of words in a lexicographical order in the sorted algorithm ensures us that we will not encounter any confluence states during the traversal on the common prefix.</Paragraph> <Paragraph position="6"> When the process of traversing the common prefix (up to a confluence state) and adding the suffix is complete, further modifications follow. We must recalculate the equivalence class of each state on the path of the new word. If any equivalence class changes, we must also recalculate the equivalence classes of all of the parents of all of the states in the changed class. Interestingly, this process could actually make the new minimal dictionary smaller. For example, if we add the word abe to the dictionary at the right of Figure 3 while maintaining minimality, we obtain the dictionary shown in the middle of Figure 3, which is one state smaller. The resulting algorithm</Paragraph> <Paragraph position="8"> Several changes to the functions used in the sorted algorithm are necessary to handle the general case of unsorted data. The replace_or_register procedure needs to be modified slightly.</Paragraph> <Paragraph position="9"> Since new words are added in arbitrary order, one can no longer assume that the last child (lexicographically) of the state (the one that has been added most recently) is the child whose equivalence class may have changed. Now, all children of a state must be checked; not Only the most recently altered child. However, at most one child may need treatment, so the execution time is of the same order. Also, in the sorted algorithm, add_suffiz is never passed c as an argument, whereas this may occur in the unsorted version of the algorithm. The effect is that the LastState should be marked as final since the common prefix is, in fact, the entire word.</Paragraph> <Paragraph position="10"> Finally, the new function first.state simply traverses the dictionary using the given word prefix and returns the first confluence state it encounters. If no such state exists, first_state returns O.</Paragraph> <Paragraph position="11"> As in the sorted case, the main loop of the unsorted algorithm executes ra times, where ra is the number of words accepted by the dictionary. The inner loops are executed at most Jwl times for each word. Putting a state into the register takes O(logn), although it may be constant when using a hash table. The same estimation is valid for a removal from the register.</Paragraph> <Paragraph position="12"> So the time complexity of the algorithm remains the same, but the constant changes. Similarly, hashing can be used to provide an efficient method of determining the state equivalence classes.</Paragraph> <Paragraph position="13"> For sorted data, only a single path through the dictionary could possibly be changed each time a new word is added. For unsorted data,.however, the changes frequently fan-out and percolate all the way back to the start state, so processing each word takes more time.</Paragraph> <Paragraph position="14"> An algorithm described by Revuz \[2\] also constructs a dictionary from sorted data while performing a partial minimization on-the-fly. Data is sorted in reverse order and that prop-erty is used to compress the endings of words within the dictionary as it is being built 4. The minimization still involves finding an equivalence relation over all of the States of the pseudominimal dictionary 6. However, the time complexity of the subset construction minimization can be reduced somewhat by using knowlei~ge of the pseudo-minimization process. Although this pseudo-minimization technique is more economic in its use of memory than traditional techniques, we are still left with a sub-minimal dictionary which can be a factor of 8 times larger (\[2\], the DELAF dictionary) than the equivalent minimal dictionary.</Paragraph> <Paragraph position="15"> This new algorithm can also be used to construct transducers. The alphabet of the (transducing) automaton would be 2~1 x 272, where ~71 and PS72 are the alphabet of the levels. Alternatively, as previously described, elements of ZT~ can be associated with the final states of the dictionary and only output once a valid word from E~ is recognized.</Paragraph> </Section> <Section position="7" start_page="53" end_page="53" type="metho"> <SectionTitle> 5 Conclusions II </SectionTitle> <Paragraph position="0"> We have presented two new methods for constructing minimal, deterministic, acyclic finite state automata whose languages are word sets (possibly with corresponding annotations). Both can be used to construct transducers as well as traditional acceptors. Their main advantage is their extremely low intermediate memory requirements which are achieved by building and minimizing the dictionaries incrementally. The total construction time of these minimal dictionaries is dramatically reduced from previous algorithms. The algorithm constructing a dictionary from sorted data can be used in parallel with other algorithms that traverse or utilize the dictionary since parts of the dictionary that are already constructed are no longer subject to future change.</Paragraph> </Section> class="xml-element"></Paper>