File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1065_metho.xml

Size: 25,017 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1065">
  <Title>FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Finite-State Automata
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Weighted Finite-State Transducer
</SectionTitle>
      <Paragraph position="0"> The basic theory of weighted finite-state automata has been reviewed in numerous papers (Mohri, 1997; Allauzen et al., 2003). We will introduce the notation briefly.</Paragraph>
      <Paragraph position="1"> A semiring (K;';&gt;;0;1) is a structure with a set K and two binary operations ' and &gt; such that (K;';0) is a commutative monoid, (K;&gt;;1) is a monoid and &gt; distributes over ' and 0 &gt;</Paragraph>
      <Paragraph position="3"> also associate the term weights with the elements of a semiring. Semirings that are frequently used in speech recognition are the positive real semiring (IR[f!1;+1g;'log;+;+1;0) with a'log b = !log(e!a + e!b) and the tropical semiring (IR[f!1;+1g;min;+;+1;0) representing the well-known sum and maximum weighted path criteria. null A weighted finite-state transducer (Q;S [ f+g;Ohm [ f+g;K;E;i;F;,;%0) is a structure with a set Q of states1, an alphabet S of input symbols, an alphabet Ohm of output symbols, a weight semiring K (we assume it k-closed here for some algorithms as described in (Mohri and Riley, 2001)), a set E QPS(S[f+g)PS(Ohm[f+g)PSK PSQ of arcs, a single initial state i with weight , and a set of final states F weighted by the function %0 : F ! K: To simplify the notation we will also denote with QT and ET the set of states and arcs of a transducer T: A weighted finite-state acceptor is simply a weighted finite-state transducer without the output alphabet.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Composition
</SectionTitle>
      <Paragraph position="0"> As we will refer to this example throughout the paper we shortly review the composition algorithm here. Let T1 : S/PSOhm/ ! K and T2 : Ohm/PSG/ ! K be two transducers defined over the same semiring K. Their composition T1 -T2 realizes the function T : S/PSG/ ! K and the theory has been described in detail in (Pereira and Riley, 1996).</Paragraph>
      <Paragraph position="1"> For simplification purposes, let us assume that the input automata are +-free and S = (Q1PSQ2;^;! ;empty) is a stack of state tuples of T1 and T2 with push, pop and empty test operations. A non lazy version of composition is shown in Figure 1.</Paragraph>
      <Paragraph position="2"> Composition of automata containing + labels is more complex and can be solved by using an intermediate filter transducer that also has been described in (Pereira and Riley, 1996).</Paragraph>
      <Paragraph position="3"> 1we do not restrict this to be a finite set as most algorithms of the lazy implementation presented in this paper also support a virtually infinite set</Paragraph>
      <Paragraph position="5"> What we can see from the pseudo-code above is that composition uses tuples of states of the two input transducers to describe states of the target transducer. Other operations defined on weighted finite-state automata use different abstract states. For example transducer determinization (Mohri, 1997) uses a set of pairs of states and weights. However, it is more convenient to use integers as state indices for an implementation. Therefore algorithms usually maintain a mapping from abstract states to integer state indices. This mapping has linear memory requirements of O(jQTj) which is quite attractive, but that depends on the structure of the abstract states. Especially in case of determinization where the size of an abstract state may vary, the complexity is no longer linear in general.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Local Algorithms
</SectionTitle>
      <Paragraph position="0"> Mohri and colleagues pointed out (Mohri et al., 2000b) that a special class of transducer algorithms can be computed on demand. We will give a more detailed analysis here. We focus on algorithms that produce a single transducer and refer to them as algorithmic transducers.</Paragraph>
      <Paragraph position="1"> Definition: Let be the input configuration of an algorithm A( ) that outputs a single finite-state transducer T: Additionally, let M : S ! QT be a one-to-one mapping from the set of abstract state descriptions S that A generates onto the set of states of T. We call A local iff for all states s 2 QT A can generate a state s of T and all outgoing arcs (s;i;o;w;s0) 2 ET; depending only on its abstract state M!1(s) and the input configuration : With the preceding definition it is quite easy to prove the following lemma: Lemma: An algorithm A that has the local prop-erty can be built on demand starting with the initial state iTA of its associated algorithmic transducer TA: Proof: For the proof it is sufficient to show that we can generate and therefore reach all states of TA: Let S be a stack of states of TA that we still have to process. Due to the one-to-one mapping M we can map each state of TA back to an abstract state of A: By definition the abstract state is sufficient to generate the complete state and its outgoing arcs.</Paragraph>
      <Paragraph position="2"> We then push those target states of all outgoing arcs onto the stack S that have not yet been processed.</Paragraph>
      <Paragraph position="3"> As TA is finite the traversal ends after all states of TA as been processed exactly once. 2 Algorithmic transducers that can be computed on-demand are also called lazy or virtual transducers. Note, that due to the local property the set of states does not necessarily be finite anymore.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Toolkit
</SectionTitle>
    <Paragraph position="0"> The current implementation is the second version of this toolkit. For the first version - which was called FSM - we opted for using C++ templates to gain efficiency, but algorithms were not lazy. It turned out that the implementation was fast, but many operations wasted a lot of memory as their resulting transducer had been fully expanded in memory. However, we plan to also make this initial version publically available.</Paragraph>
    <Paragraph position="1"> The design principles of the second version of the toolkit, which we will call FSA, are: + decoupling of data structures and algorithms, + on-demand computation for increased memory efficiency, + low computational costs, + an abstract interface to alphabets to support lazy mappings from strings to indices for arc labels, + an abstract interface to semirings (should be k-closed for at least some algorithms), + implementation in C++, as it is fast, ubiquitous and well-known by many other researchers, + easy to use interfaces.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The C++ Library Implementation
</SectionTitle>
      <Paragraph position="0"> We use the lemma from Section 2.3 to specify an interface for lazy algorithmic transducers directly.</Paragraph>
      <Paragraph position="1"> The code written in pseudo-C++ is given in Figure 2. Note that all lazy algorithmic transducers are derived from the class Automaton.</Paragraph>
      <Paragraph position="2"> The lazy interface also has disadvantages. The virtual access to the data structure might slow computations down, and obtaining global information about the automaton becomes more complicated.</Paragraph>
      <Paragraph position="3"> For example the size of an automaton can only be  stract datatype of transducers. Note that R&lt;T&gt; refers to a smart pointer of T.</Paragraph>
      <Paragraph position="4"> computed by traversing it. Therefore central algorithms of the RWTH FSA toolkit are the depth-first search (DFS) and the computation of strongly connected components (SCC). Efficient versions of these algorithms are described in (Mehlhorn, 1984) and (Cormen et al., 1990).</Paragraph>
      <Paragraph position="5"> It is very costly to store arbitrary types as arc labels within the arcs itself. Therefore the RWTH FSA toolkit offers alphabets that define mappings between strings and label indices. Alphabets are implemented using the abstract interface shown in Figure 4. With alphabets arcs only need to store the abstract label indices. The interface for alphabets is defined using a single constant: for each label index an alphabet reports it must ensure to always deliver the same symbol on request through getSymbol().</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Algorithms
</SectionTitle>
      <Paragraph position="0"> The current implementation of the toolkit offers a wide range of well-known algorithms defined on weighted finite-state transducers: + basic operations sort (by input labels, output labels or by to-</Paragraph>
      <Paragraph position="2"> are used to gain maximum efficiency. Mapping of arc labels is necessary as symbol indices may differ between alphabets.</Paragraph>
      <Paragraph position="3"> tal arc), map-input and -output labels symbolically (as the user expects that two alphabets match symbolically, but their mapping to label indices may differ), cache (helps to reduce computations with lazy implementations), topologically-sort states + rational operations project-input, project-output, transpose (also known as reversal: calculates an equivalent automaton with the adjacency matrix being transposed), union, concat, invert + classical graph operations depth-first search (DFS), single-source shortest path (SSSP), connect (only keep accessible and coaccessible state), strongly connected  prune (based on forward/backward state potentials), posterior, push (push weights toward initial/final states), failure (given an acceptor/transducer defined over the tropical semiring converts +-transitions to failure transitions)  + diagnostic operations count (counts states, final states, different arc types, SCCs, alphabet sizes, :::) + input/output operations supported input and/or output formats are: AT&amp;T (currently, ASCII only), binary (fast,  uses fixed byte-order), XML (slower, any encoding, fully portable), memory-mapped (also on-demand), dot (AT&amp;T graphviz) We will discuss some details and refer to the publication of the algorithms briefly. Most of the basic operations have a straigthforward implementation. As arc labels are integers in the implementation and their meaning is bound to an appropriate symbolic alphabet, there is the need for symbolic mapping between different alphabets. Therefore the toolkit provides the lazy map-input and map-output transducers, which map the input and output arc indices of an automaton to be compatible with the indices of another given alphabet.</Paragraph>
      <Paragraph position="4"> The implementations of all classical graph algorithms are based on the descriptions of (Mehlhorn, 1984) and (Cormen et al., 1990) and (Mohri and Riley, 2001) for SSSP. The general graph algorithms DFS and SCC are helpful in the realisation of many other operations, examples are: transpose, connect and count. However, counting the number of states of an automaton or the number of symbols of an alphabet is not well-defined in case of an infinite set of states or symbols.</Paragraph>
      <Paragraph position="5"> SSSP and transpose are the only two algorithms without a lazy implementation. The result of SSSP is a list of state potentials (see also (Mohri and Riley, 2001)). And a lazy implementation for transpose would be possible if the data structures provide lists of both successor and predecessor arcs at each state. This needs either more memory or more computations and increases the size of the abstract interface for the lazy algorithms, so as a compromise we omitted this.</Paragraph>
      <Paragraph position="6"> The implementations of compose (Pereira and Riley, 1996), determinize (Mohri, 1997), minimize (Mohri, 1997) and remove-epsilons (Mohri, 2001) use more refined methods to gain efficiency. All use at least the lazy cache transducer as they refer to states of the input transducer(s) more than once. With respect to the number of lazy transducers involved in computing the result, compose has the most complicated implementation. Given the implementations for the algorithmic transducers cache, map-output, sort-input, sort-output and simple-compose that assumes arc labels to be compatible and sorted in order to perform matching as fast as possible, the final implementation of compose in the RWTH FSA toolkit is given in figure 3. So, the current implementation of compose uses 6 algorithmic transducers in addition to the two input automata. Determinize additionally uses lazy cache and sort-input transducers.</Paragraph>
      <Paragraph position="7"> The search algorithms best and n-best are based on (Mohri and Riley, 2002), push is based on (Mohri and Riley, 2001) and failure mainly uses ideas from (Allauzen et al., 2003). The algorithms posterior and prune compute arc posterior probabilities and prune arcs with respect to them. We believe they are standard algorithms defined on probabilistic networks and they were simply ported to the framework of weighted finite-state automata.</Paragraph>
      <Paragraph position="8"> Finally, the RWTH FSA toolkit can be loosely interfaced to the AT&amp;T FSM LibraryTM through its ASCII-based input/output format. In addition, a new XML-based file format primarly designed as being human readable and a fast binary file format are also supported. All file formats support optional on-the-fly compression using gzip.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 High-Level Interfaces
</SectionTitle>
      <Paragraph position="0"> In addition to the C++ library level interface the toolkit also offers two high-level interfaces: a Python interface, and an interactive command-line interface.</Paragraph>
      <Paragraph position="1"> The Python interface has been built using the SWIG interface generator (Beazley et al., 1996) and enables rapid development of larger applications without lengthy compilation of C++ code. The command-line interface comes handy for quickly applying various combinations of algorithms to transducers without writing any line of code at all. As the Python interface is mainly identical to the C++ interface we will only give a short impression of how to use the command-line interface.</Paragraph>
      <Paragraph position="2"> The command-line interface is a single executable and uses a stack-based execution model (postfix notation) for the application of operations. This is different from the pipe model that AT&amp;T command-line tools use. The disadvantage of using pipes is that automata must be serialized and get fully expanded by the next executable in chain.</Paragraph>
      <Paragraph position="3"> However, an advantage of multiple executables is that memory does not get fragmented through the interaction of different algorithms.</Paragraph>
      <Paragraph position="4"> With the command-line interface, operations are applied to the topmost transducers of the stack and the results are pushed back onto the stack again. For example, &gt; fsa A B compose determinize draw readsAandBfrom files, calculates the determinized composition and writes the resulting automaton to the terminal in dot format (which may be piped to dot directly). As you can see from the examples some operations like write or draw take additional arguments that must follow the name of the operation. Although this does not follow the strict postfix design, we found it more convenient as these parameters are not automata.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Comparison of Toolkits
</SectionTitle>
      <Paragraph position="0"> A crucial aspect of an FSA toolkit is its computational and memory efficiency. In this section we will compare the efficiency of four different implementations of weighted-finite state toolkits, namely:  We opted to not evaluate the FSA6.1 from (van Noord, 2000) as we found that it is not easy to install and it seemed to be significantly slower than any of the other implementations. RWTH FSA and the AT&amp;T FSM LibraryTM use on-demand computations whereas FSM and WFST do not. As the algorithmic code between RWTH FSA and its predecessor RWTH FSM has not changed much except for the interface of lazy transducers, we can also compare lazy versus non lazy implementation.</Paragraph>
      <Paragraph position="1"> Nevertheless, this direct comparison is also possible with RWTH FSA as it provides a static storage class transducer and a traversing deep copy operation.</Paragraph>
      <Paragraph position="2"> Table 1 summarizes the tasks used for the evaluation of efficiency together with the sizes of the resulting transducers. The exact meaning of the different transducers is out of scope of this comparison. We simply focus on measuring the efficiency of the algorithms. Experiment 1 is the full expansion of the static part of a speech recognition search network. Experiment 2 deals with a translation problem and splits words of a &amp;quot;bilanguage&amp;quot; into single words. The meaning of the transducers used for Experiment 2 will be described in detail in Section 4.2. Experiment 3 is similar to Experiment 1 except for that the grammar transducer is exchanged with a translation transducer and the result represents the static network for a speech-to-text translation system. null  ory using Linux as operating system. Table 2 summarizes the peak memory usage of the different toolkit implementations for the given tasks and Table 3 shows the CPU usage accordingly.</Paragraph>
      <Paragraph position="3"> As can be seen from Tables 2 and 3 for all given tasks the RWTH FSA toolkit uses less memory and computational power than any of the other toolkits. However, it is unclear to the authors why the AT&amp;T LibraryTM is a factor of 1800 slower for experiment 2. The numbers also do not change much after additionally connecting the composition result (as in RWTH FSA compose does not connect the result by default): memory usage rises to 62 MB and execution time increases to 9.7 seconds. However, a detailed analysis for the RWTH FSA toolkit has shown that the composition task of experiment 2 makes intense use of the lazy cache transducer due to the loop character of the two transducers C1 and C2: It can also be seen from the two tables that the lazy implementation RWTH FSA uses significantly less memory than the non lazy implementation RWTH FSM and less than half of the CPU time. One explanation for this is the poor memory management of RWTH FSM as all intermediate results need to be fully expanded in memory. In contrast, due to its lazy transducer interface, RWTH FSA may allocate memory for a state only once and reuse it for all subsequent calls to the getState() method.</Paragraph>
      <Paragraph position="4">  cluding I/O using a 1.2GHz AMD Athlon processor (/ exceeded memory limits: given time indicates point of abortion).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Statistical Machine Translation
</SectionTitle>
      <Paragraph position="0"> Statistical machine translation may be viewed as a weighted language transduction problem (Vidal, 1997). Therefore it is fairly easy to build a machine translation system with the use of weighted finite-state transducers.</Paragraph>
      <Paragraph position="1"> Let fJ1 and eIi be two sentences from a source and target language respectively. Also assume that we have word level alignments A of all sentences from a bilingual training corpus. We denote with epJp1 the segmentation of a target sentence eI1 into phrases such that fJ1 and epJp1 can be aligned monotoneously. This segmentation can be directly calculated from the alignments A: Then we can formulate the problem of finding the best translation ^eI1 of a source sentence as follows:  The last line suggests to solve the translation problem by estimating a language model on a bilanguage (see also (Bangalore and Riccardi, 2000; Casacuberta et al., 2001)). An example of sentences from this bilanguage is given in Figure 5 for the translation task Vermobil (German ! English). For technical reasons, +-labels are represented by a $ symbol. Note, that due to the fixed segmentation given by the alignments, phrases in the target language are moved to the last source word of an alignment block.</Paragraph>
      <Paragraph position="2"> So, given an appropriate alignment which can be obtained by means of the pubically available  GIZA++ toolkit (Och and Ney, 2000), the approach is very easy in practice: 1. Transform the training corpus with a given alignment into the corresponding bilingual corpus null 2. Train a language model on the bilingual corpus 3. Build an acceptor A from the language model  The symbols of the resulting acceptor are still a mixture of words from the source language and phrases from the target language. So, we additionally use two simple transducers to split these bilingual words (C1 maps source words fj to bilingual words that start with fj and C2 maps bilingual words with the target sequence epj to the sequences of target words the phrase was made of): 4. Split the bilingual phrases of A into single words:</Paragraph>
      <Paragraph position="4"> Then the translation problem from above can be rewritten using finite-state terminology: dann|$ melde|$ ich|I_am_calling mich|$ noch|$ einmal|once_more .|.</Paragraph>
      <Paragraph position="6"> Translation results using this approach are summarized in Table 4 and are being compared with results obtained using the alignment template approach (Och and Ney, 2000). Results for both approaches were obtaining using the same training corpus alignments. Detailed task descriptions for Eutrans/FUB and Verbmobil can be found in (Casacuberta et al., 2001) and (Zens et al., 2002) respectively. We use the usual definitions for word error rate (WER), position independent word error rate (PER) and BLEU statistics here.</Paragraph>
      <Paragraph position="7"> For the simpler tasks Eutrans, FUB and PF-Star, the WER, PER and the inverted BLEU statistics are close for both approaches. On the German-to-English Verbmobil task the FSA approach suffers from long distance reorderings (captured through the fixed training corpus segmentation), which is not very surprising.</Paragraph>
      <Paragraph position="8"> Although we do not have comparable numbers of the memory usage and the translation times for the alignment template approach, resource usage of the finite-state approach is quite remarkable as we only use generic methods from the RWTH FSA toolkit and full search (i.e. we do not prune the search space). However, informal tests have shown that the finite-state approach uses much less memory and computations than the current implementation of the alignment template approach.</Paragraph>
      <Paragraph position="9"> Two additional advantages of finite-state methods for translation in general are: the input to the search algorithm may also be a word lattice and it is easy to combine speech recognition with translation in order to do speech-to-speech translation.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Summary
</SectionTitle>
    <Paragraph position="0"> In this paper we have given a characterization of algorithms that produce a single finite-state automaton and bear an on-demand implementation. For this purpose we formally introduced the local prop-erty of such an algorithm.</Paragraph>
    <Paragraph position="1"> We have described the efficient implementation of a finite-state toolkit that uses the principle of lazy algorithmic transducers for almost all algorithms. Among several publically available toolkits, the RWTH FSA toolkit presented here turned out to be the most efficient one, as several tests showed.</Paragraph>
    <Paragraph position="2"> Additionally, with lazy algorithmic transducers we have reduced the memory requirements and even increased the speed significantly compared to a non lazy implementation.</Paragraph>
    <Paragraph position="3"> We have also shown that a finite-state automata toolkit supports rapid solutions to problems from the field of natural language processing such as statistical machine translation. Despite the genericity of the methods, statistical machine translation can be done very efficiently.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML