File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3255_metho.xml
Size: 8,738 bytes
Last Modified: 2025-10-06 14:09:31
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3255"> <Title>Efficient Decoding for Statistical Machine Translation with a Fully Expanded WFST Model</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 IBM Model </SectionTitle> <Paragraph position="0"> For our decoding research, we assume the IBM-style modeling for translation proposed in Brown et al. (1993). In this model, translation from Japanese</Paragraph> <Paragraph position="2"> a7 is referred to as a translation model. In this paper, we use word trigram for a language model and IBM model 3 for a translation model.</Paragraph> <Paragraph position="3"> The translation model is represented as follows considering all possible word alignments.</Paragraph> <Paragraph position="5"> The IBM model only assumes a one-to-many word alignment, where a Japanese word a0 in the a29 -th position connects to the English word a1 in the a9a31a30 -th position.</Paragraph> <Paragraph position="6"> The IBM model 3 uses the following a2a4a3 a0a27a23 a9 a5a1 a7 .</Paragraph> <Paragraph position="8"> and it is called fertility. Note, however, that a34a37a36 is the number of words connecting to null words.</Paragraph> <Paragraph position="10"> a7 is conditional probability where English word a1 a52 connects to a34 words in a0 . a57 a3 a34 a5a1 a52 a7 is called fertility probability. a59 a3 a0 a30 a5a1 a52 a7 is conditional probability where English word a1 a52 is translated to Japanese word a0 a30 and called translation probability.</Paragraph> <Paragraph position="12"> a7 is conditional probability where the English word in the a66 -th position connects to the the Japanese word in the a29 -th position on condition that the length of the English sentence a1 and Japanese sentence a0 are a64 and a14 , respectively. a62 a3 a29a8a5a66 a23a61a64a67a23 a14 a7 is called distortion probability. In our experiment, we used the IBM model 3 while assuming constant distortion probability for simplicity.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 WFST Cascade Model </SectionTitle> <Paragraph position="0"> WFST is a finite-state device in which output symbols and output weights are defined as well as input symbols. Using composition (Pereira and Riley, 1997), we can obtain the combined WFST a68 a58a70a69 a68 a43by connecting each output of</Paragraph> <Paragraph position="2"> If we assume that each submodel of Equation (1) is represented by a WFST, a conventional decoder can be considered to compose submodels dynamically.</Paragraph> <Paragraph position="4"> The main idea of the proposed approach is to compute the composition beforehand.</Paragraph> <Paragraph position="5"> Figure 1 shows the translation process modeled by a WFST cascade. This WFST cascade model (Knight and Al-Onaizan, 1998) represents the IBM model 3 described in the previous section. Any possible permutations of the Japanese sentence are inputed to the cascade. First, T model(a68 ) translates the Japanese word to an English word. NULL model(a71 ) deletes special word NULL. Fertility model(a72 ) merges the same continuous words into one word. At each stage, the probability represented by the weight of a WFST is accumulated. Finally, the weight of language model (a73 ) is accumulated.</Paragraph> <Paragraph position="6"> If WFST a74 represents all permutations of the input sentence, decoding can be considered to search for the best path of a74 a69 a68 a69 a71 a69 a72 a69 a73 . Therefore, computing a68 a69 a71 a69 a72 a69 a73 in advance can improve the efficiency of the decoder.</Paragraph> <Paragraph position="7"> For a68 , a71 , and a72 , we adopt the representation of Knight and Al-Onaizan (1998). For a73 , we adopt the representation of Mohri et al. (2002). Figures 25 show examples of submodel representation with WFSTs. a75 a3 a17 a7 in Figure 5 stands for a back-off parameter. Conditional branches are represented by nondeterministic paths in the WFST.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Ambiguity Reduction </SectionTitle> <Paragraph position="0"> If we can determinize a fully-expanded WFST, we can achieve the best performance of the decoder.</Paragraph> <Paragraph position="1"> kaku tekisuto SGMLdeko-do ka sareruha each text encoded in SGMLNULL encoded encoded each text encoded in SGMLencoded encoded each text is in SGMLencoded each text is in SGMLencoded However, the composed WFST for machine translation is not obviously determinizable. The word-to-word translation model a68 strongly contributes to WFST's ambiguity while the a76 transition of other submodels also contributes to ambiguity. Mohri et al. (2002) proposed a technique that added special symbols allowing the WFST to be determinizable. Determinization using this technique, however, is not expected to achieve efficient decoding in machine translation because the WFSTs of machine translation are inherently ambiguous.</Paragraph> <Paragraph position="2"> To overcome this problem, we propose a novel WFST optimization approach that uses decoding information. First, our method merges WFST states by considering the statistics of hypotheses while decoding. After merging the states, redundant edges whose beginning states, end states, input symbols, and output symbols are the same are also reduced. IBM models consider all possible alignments while a decoder searches for only the most appropriate alignment. Therefore, there are many redundant states in the full-expansion WFST from the view-point of decoding.</Paragraph> <Paragraph position="3"> We adopted a standard decoding algorithm in the speech recognition field, where the forward is beam-search and the backward is a77a79a78 search. Since beam-search is adopted in the forward pass, the obtained results are not optimal but suboptimal. All input permutations are represented by a finite-state acceptor (Figure 6), where each state corresponds to input positions that are already read. In the forward search, hypotheses are maintained for each state of the finite-state acceptor.</Paragraph> <Paragraph position="4"> The WFST states that always appear together in the same hypothesis list of the forward beam-search should be equated if the states contribute to correct translation. Let a80 be a full-expansion WFST model and a81a82a1 a0a61a83 be a WFST that represents the correct translation of an input sentence a0 . For each a0 , the states of a80 that always appear together in the same hypothesis list in the course of decoding a0</Paragraph> <Paragraph position="6"> a0a61a83 are merged in our method. Simply merging states of a80 may increase model errors, but a81a82a1 a0a61a83 corrects the errors caused by merging states. Unlike ordinary FSA minimization, states are merged without considering their successor states. If the weight represents probability, thesum of the weights of output transitions may not be 1.0 after merging states, and then thecondition of probability may be destroyed. Since the decoder does not sum up all possible paths but searches for the most appropriate paths, this kind of state merging does not pose a serious problem in practice.</Paragraph> <Paragraph position="7"> In the following experiment, we measured the association between states by a34 -like statistic that is bounded between 0 and 1. If the a34 of two states is higher than the specified threshold, these two states are merged. The definition of a34 is as follows, where a9 a21a85a0 a10 a1a31a86 a3 a86 a58 a23 a86 Merging the beginning and end states of a transition whose input is a76 (a76 transition for short) may cause a problem when decoding. In our implementation, weight is basically minus a64a54a98 a12 probability, and its lower bound is 0 in theory. However, there exists negative a76 transition that originated from the back-off value of n-gram. If we merge the beginning and end states of the negative a76 transition, the search process will not stop due to the negative a76 loop. To avoid this problem, we rounded the negative weight to 0 if the negative a76 loop appears during merging. In the preliminary experiment, a weight-pushing operation (Mohri and Riley, 2001) was also effective for deleting negative a76 transition of our full-expansion models. However, pushing causes an imbalance of weights among paths if the WFST is not deterministic. As a result of this imbalance, we cannot compare path costs when pruning. In fact, our preliminary experiment showed that pushed full-expansion WFST does not work well. Therefore, we adopted a simpler method to deal with a negative a76 loop as described above.</Paragraph> </Section> class="xml-element"></Paper>