File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/j04-2005_intro.xml
Size: 7,050 bytes
Last Modified: 2025-10-06 14:02:15
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-2005"> <Title>c(c) 2004 Association for Computational Linguistics Squibs and Discussions Comments on &quot;Incremental Construction and Maintenance of Minimal Finite-State Automata,&quot; by Rafael C. Carrasco and Mikel</Title> <Section position="4" start_page="232" end_page="234" type="intro"> <SectionTitle> 4. Analysis </SectionTitle> <Paragraph position="0"> The algorithm correctly adds new strings to the automaton, while maintaining its minimality. We assume that all states in the initial automaton are in the register, that there are no pairs of states with the same right language, that all states are reachable from the initial state, and that there is a path from every state to one of the final states.</Paragraph> <Paragraph position="1"> The absorption state and transitions that lead to it are not explicitly represented.</Paragraph> <Paragraph position="2"> To prove that the algorithm is correct, we need to show that 1. the language of the automaton after the addition of the string contains that string; 2. no other strings are added to the automaton; 3. no strings are removed from the automaton; 4. the automaton remains minimal except for the path of the newly added string, that is, the states covered by the path of the newly added string are representatives of the only equivalence classes that may have more than one member.</Paragraph> <Paragraph position="3"> It is easy to show that strings are indeed added to the language of the automaton. First, transitions with subsequent symbols from the strings are followed from the initial state. When there are no transitions with appropriate symbols, new ones are created. The state reachable with the string is made final. Minimization done by minim path Daciuk Comments on Carrasco and Forcada replaces states with other states that have the same right language. That operation does not change the language of the automaton.</Paragraph> <Paragraph position="4"> If the initial state has any incoming transitions, it is cloned, and the clone becomes the new initial state. That operation does not change the language of the automaton-the right language of the new initial state is exactly the same as of the old one. The old initial state is still reachable, because it has incoming transitions from either the new initial state (the old initial state had a loop) or other states that are reachable. The cloning creates a new state that is not in the register and that is equivalent to another state in the automaton. Lines 14-16 of the algorithm check whether after addition of new strings, the new initial state is equivalent to some other state in the automaton. If it is, the new initial state is replaced with the equivalent state.</Paragraph> <Paragraph position="5"> Since the automaton is deterministic, it cannot hold more than one copy of the same string. Therefore, we need only to show that no other strings are erroneously added to the automaton. Such erroneous addition could happen by creating or redirecting transitions. New transitions are created to store some suffixes of new strings that are not present in the automaton. This could lead to addition of new, superflous strings, provided the states that to which we add transitions are reentrant/confluence. However, the algorithm excludes such cases. All states in the path of the previously added string have only one incoming transition. All reentrant/confluence states not in the longest common prefix path are cloned in line 48 of function add suffix. Function minim path can redirect transitions only to states not in the longest common prefix path.</Paragraph> <Paragraph position="6"> Since states that are deleted in line 33 in function minim path (the only place in the algorithm where states are deleted) are always replaced as targets of transitions by equivalent states, strings could be deleted from the automaton only by making parts of it unreachable. However, all targets of transitions going out from a state to be deleted go to states that have more than one incoming transition--states that replaced previous targets of those transitions. This includes the case of states with no outgoing transitions.</Paragraph> <Paragraph position="7"> To show that the automaton remains minimal except for the path of the newly added string, we first note that all existing states are in the register before we start adding new strings. Adding a new string creates a single chain of states not in the register. The chain is added in its entirety with function add suffix, as the &quot;previous&quot; string for the first string is assumed to be empty. If w is the string to be added, and ) negationslash= [?], then non-reentrant states not following any reentrant states in the path from q to q are removed from the register, and reentrant states (and states that follow them) are cloned. For w</Paragraph> <Paragraph position="9"> , new states and transitions are created. This concludes forming a path for the first string. That path consists entirely of states that are not in the register and that can have an equivalent state somewhere in the rest of the automaton.</Paragraph> <Paragraph position="10"> When next strings are added, they are divided into two parts by function lcp. It divides both the previous and the next string. The first part (the longest common prefix) is shared between the previous and the next string, and it remains outside the register. This also means that for each state in that part, there may be an equivalent state in the remaining part of the automaton. The second part of the next string will form the rest of the path of states outside the register. The second part of the path of the previous string will be subject to minimization, as no further outgoing transitions will be added to any of its states in the future. Minimization replaces with their equivalent states those states in the path of the suffix of the previous string that are not unique. Since minimization is performed from the end of the string toward the longest common prefix, we can use the register and compare the states using the recursive definition Computational Linguistics Volume 30, Number 2 of the right language, replacing right languages of target states with their addresses. At the end of the process, we have an automaton that is minimal except for the path of the last string added to it. We return to the start situation.</Paragraph> <Paragraph position="11"> The algorithm has the same asymptotic complexity as the corresponding algorithms in Carrasco and Forcada (2002) and Daciuk et al. (2000). However, it is faster than algorithms for unsorted data, because it does not have to reprocess the states over and over again. Each time the original algorithm clones a state, that state is reprocessed. Cloning in the new version is limited to the part of the automaton built before addition of new strings. No state created by the algorithm is cloned afterward.</Paragraph> </Section> class="xml-element"></Paper>