File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1039_metho.xml
Size: 20,473 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1039"> <Title>A Decoder for Syntax-based Statistical MT</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Phrasal Translation </SectionTitle> <Paragraph position="0"> In (Yamada and Knight, 2001), the translation a22 is a 1-to-1 lexical translation from an English word a111 to a foreign word a112 , i.e., a106a68a4a5a22a104a8a25a7a6a113a76a114a106a92a4a97a112a115a8a111a82a6 . To allow non 1-to-1 translation, such as for idiomatic phrases or compound nouns, we extend the model as follows.</Paragraph> <Paragraph position="1"> First we use fertility a116 as used in IBM models to allow 1-to-N mapping.</Paragraph> <Paragraph position="3"> For N-to-N mapping, we allow direct translation a128 of an English phrase a111a129a78a49a111a19a81a98a84a68a84a68a84a49a111a68a130 to a foreign phrase a112a91a78a90a112a129a81a27a84a68a84a68a84a49a112a129a131 at non-terminal tree nodes as</Paragraph> <Paragraph position="5"> and linearly mix this phrasal translation with the word-to-word translation, i.e., if a17a19a18 is non-terminal. In practice, the phrase lengths (a192 ,a193 ) are limited to reduce the model size. In our experiment (Section 5), we restricted them as a194a129a84a60a194a92a193a196a195</Paragraph> <Paragraph position="7"> a194a129a84a201a200a83a193a203a202a14a200 , to avoid pairs of extremely different lengths. This formula was obtained by randomly sampling the length of translation pairs. See (Yamada, 2002) for details.</Paragraph> </Section> <Section position="5" start_page="0" end_page="2" type="metho"> <SectionTitle> 4 Decoding </SectionTitle> <Paragraph position="0"> Our statistical MT system is based on the noisy-channel model, so the decoder works in the reverse direction of the channel. Given a supposed channel output (e.g., a French or Chinese sentence), it will find the most plausible channel input (an English parse tree) based on the model parameters and the prior probability of the input.</Paragraph> <Paragraph position="1"> In the syntax-based model, the decoder's task is to find the most plausible English parse tree given an observed foreign sentence. Since the task is to build a tree structure from a string of words, we can use a mechanism similar to normal parsing, which builds an English parse tree from a string of English words.</Paragraph> <Paragraph position="2"> Here we need to build an English parse tree from a string of foreign (e.g., French or Chinese) words.</Paragraph> <Paragraph position="3"> To parse in such an exotic way, we start from an English context-free grammar obtained from the training corpus,2 and extend the grammar to in- null model. For each non-lexical rule in the original English grammar (such as &quot;VP a204 VB NP PP&quot;), we supplement it with reordered rules (e.g. &quot;VP a204 NP PP VB&quot;, &quot;VP a204 NP VB PP &quot;, etc.) and associate them with the original English order and the reordering probability from the r-table. Similarly, rules such as &quot;VP a204 VP X&quot; and &quot;X a204 word&quot; are added for extra word insertion, and they are associated with a probability from the n-table. For each lexical rule in the English grammar, we add rules such as &quot;englishWord a204 foreignWord&quot; with a probability from the t-table.</Paragraph> <Paragraph position="4"> Now we can parse a string of foreign words and build up a tree, which we call a decoded tree. An example is shown in Figure 2. The decoded tree is built up in the foreign language word order. To obtain a tree in the English order, we apply the reverse of the reorder operation (back-reordering) using the information associated to the rule expanded by the r-table. In Figure 2, the numbers in the dashed oval near the top node shows the original english order.</Paragraph> <Paragraph position="5"> Then, we obtain an English parse tree by removing the leaf nodes (foreign words) from the backreordered tree. Among the possible decoded trees, we pick the best tree in which the product of the LM probability (the prior probability of the English tree) and the TM probability (the probabilities associated pairs of English parse trees and foreign sentences.</Paragraph> <Paragraph position="7"> with the rules in the decoded tree) is the highest.</Paragraph> <Paragraph position="8"> The use of an LM needs consideration. Theoretically we need an LM which gives the prior probability of an English parse tree. However, we can approximate it with an n-gram LM, which is well-studied and widely implemented. We will discuss this point later in Section 7.</Paragraph> <Paragraph position="9"> If we use a trigram model for the LM, a convenient implementation is to first build a decoded-tree forest and then to pick out the best tree using a trigram-based forest-ranking algorithm as described in (Langkilde, 2000). The ranker uses two leftmost and rightmost leaf words to efficiently calculate the trigram probability of a subtree, and finds the most plausible tree according to the trigram and the rule probabilities. This algorithm finds the optimal tree in terms of the model probability -- but it is not practical when the vocabulary size and the rule size grow. The next section describes how to make it practical.</Paragraph> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> 5 Pruning </SectionTitle> <Paragraph position="0"> We use our decoder for Chinese-English translation in a general news domain. The TM becomes very huge for such a domain. In our experiment (see Section 6 for details), there are about 4M non-zero entries in the trained a106a68a4a97a112a115a8a111a83a6 table. About 10K CFG rules are used in the parsed corpus of English, which results in about 120K non-lexical rules for the decoding grammar (after we expand the CFG rules as described in Section 4). We applied the simple algorithm from Section 4, but this experiment failed -- no complete translations were produced. Even four-word sentences could not be decoded. This is not only because the model size is huge, but also because the decoder considers multiple syntactic structures for the same word alignment, i.e., there are several different decoded trees even when the translation of the sentence is the same. We then applied the following measures to achieve practical decoding. The basic idea is to use additional statistics from the training corpus.</Paragraph> <Paragraph position="1"> beam search: We give up optimal decoding by using a standard dynamic-programming parser with beam search, which is similar to the parser used in (Collins, 1999). A standard dynamic-programming parser builds up a245 nonterminal, inputsubstringa246 tuples from bottom-up according to the grammar rules. When the parsing cost3 comes only from the features within a subtree (TM cost, in our case), the parser will find the optimal tree by keeping the single best subtree for each tuple. When the cost depends on the features outside of a subtree, we need to keep all the subtrees for possible different outside features (boundary words for the trigram LM cost) to obtain the optimal tree. Instead of keeping all the subtrees, we only retain subtrees within a beam width for each input-substring. Since the outside features are not considered for the beam pruning, the optimality of the parse is not guaranteed, but the required memory size is reduced.</Paragraph> <Paragraph position="2"> t-table pruning: Given a foreign (Chinese) sentence to the decoder, we only consider English words a111 for each foreign word a112 such that Pa4a97a111a26a8a112a247a6 is high. In addition, only limited part-of-speech labels a248 are considered to reduce the number of possible decoded-tree structures. Thus we only use the top-5</Paragraph> <Paragraph position="4"> Section 2). The pair must appear more than once in the Viterbi alignments4 of the training corpus. Then we use the top-10 pairs ranked similarly to t-table pruning above, except we replace Pa4</Paragraph> <Paragraph position="6"> ing, we effectively remove junk phrase pairs, most of which come from misaligned sentences or untranslated phrases in the training corpus.</Paragraph> <Paragraph position="7"> r-table pruning: To reduce the number of rules for the decoding grammar, we use the top-N rules ranked by Pa4 rulea6 Pa4 reorda6 so that</Paragraph> <Paragraph position="9"> a prior probability of the rule (in the original English order) found in the parsed English corpus, and Pa4 reorda6 is the reordering probability in the TM. The product is a rough estimate of how likely a rule is used in decoding. Because only a limited number of reorderings are used in actual translation, a small number of rules are highly probable. In fact, among a total of 138,662 reorder-expanded rules, the most likely 875 rules contribute 95% of the probability mass, so discarding the rules which contribute the lower 5% of the probability mass efficiently eliminates more than 99% of the total rules.</Paragraph> <Paragraph position="10"> zero-fertility words: An English word may be translated into a null (zero-length) foreign word.</Paragraph> <Paragraph position="11"> This happens when the fertility a116a103a4a13a7a89a8a111a83a6 a246a14a7 , and such English word a111 (called a zero-fertility word) must be inserted during the decoding. The decoding parser is modified to allow inserting zero-fertility words, but unlimited insertion easily blows up the memory space. Therefore only limited insertion is allowed.</Paragraph> <Paragraph position="12"> Observing the Viterbi alignments of the training corpus, the top-20 frequent zero-fertility words5 cover over 70% of the cases, thus only those are allowed to be inserted. Also we use syntactic context to limit the insertion. For example, a zero-fertility word in is inserted as IN when &quot;PP a204 IN NP-A&quot; rule is applied. Again, observing the Viterbi alignments, the top-20 frequent contexts cover over 60% of the cases, so we allow insertions only in these contexts.</Paragraph> <Paragraph position="13"> This kind of context sensitive insertion is possible because the decoder builds a syntactic tree. Such selective insertion by syntactic context is not easy for The pruning techniques shown above use extra statistics from the training corpus, such as Pa4 a248 a6 , Pa4a97a111a26a8a248 a6 , and Pa4 rulea6 . These statistics may be considered as a part of the LM Pa4a5a3a7a6 , and such syntactic probabilities are essential when we mainly use tri-grams for the LM. In this respect, the pruning is useful not only for reducing the search space, but also improving the quality of translation. We also use statistics from the Viterbi alignments, such as the phrase translation frequency and the zero-fertility context frequency. These are statistics which are not modeled in the TM. The frequency count is essentially a joint probability Pa4a97a112 a79a90a111a82a6 , while the TM uses a conditional probability Pa4a97a112a115a8a111a82a6 . Utilizing statistics outside of a model is an important idea for statistical machine translation in general. For example, a decoder in (Och and Ney, 2000) uses alignment template statistics found in the Viterbi alignments.</Paragraph> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Experimental Results: Chinese/English </SectionTitle> <Paragraph position="0"> This section describes results from our experiment using the decoder as described in the previous section. We used a Chinese-English translation corpus for the experiment. After discarding long sentences (more than 20 words in English), the English side of the corpus consisted of about 3M words, and it was parsed with Collins' parser (Collins, 1999). Training the TM took about 8 hours using a 54-node unix cluster. We selected 347 short sentences (less than 14 words in the reference English translation) from the held-out portion of the corpus, and they were used for evaluation.</Paragraph> <Paragraph position="1"> Table 1 shows the decoding performance for the test sentences. The first system ibm4 is a reference system, which is based on IBM Model4. The second and the third (syn and syn-nozf) are our decoders.</Paragraph> <Paragraph position="2"> Both used the same decoding algorithm and pruning as described in the previous sections, except that syn-nozf allowed no zero-fertility insertions. The average decoding speed was about 100 seconds6 per sentence for both syn and syn-nozf.</Paragraph> <Paragraph position="3"> As an overall decoding performance measure, we used the BLEU metric (Papineni et al., 2002). This measure is a geometric average of n-gram accuracy, adjusted by a length penalty factor LP.7 The n-gram accuracy (in percentage) is shown in Table 1 as P1/P2/P3/P4 for unigram/bigram/trigram/4-gram.</Paragraph> <Paragraph position="4"> Overall, our decoder performed better than the IBM system, as indicated by the higher BLEU score. We obtained better n-gram accuracy, but the lower LP score penalized the overall score. Interestingly, the system with no explicit zero-fertility word insertion (syn-nozf) performed better than the one with zero-fertility insertion (syn). It seems that most zero-fertility words were already included in the phrasal translations, and the explicit zero-fertility word insertion produced more garbage than expected words.</Paragraph> <Paragraph position="5"> To verify that the pruning was effective, we relaxed the pruning threshold and checked the decoding coverage for the first 92 sentences of the test data. Table 2 shows the result. On the left, the r-table pruning was relaxed from the 95% level to 98% or 100%. On the right, the t-table pruning was relaxed from the top-5 (a111 ,a248 ) pairs to the top-10 or top-20 pairs. The system r95 and w5 are identical to syn-nozf in Table 1.</Paragraph> <Paragraph position="6"> When r-table pruning was relaxed from 95% to 98%, only about half (47/92) of the test sentences were decoded, others were aborted due to lack of memory. When it was further relaxed to 100% (i.e., no pruning was done), only 20 sentences were decoded. Similarly, when the t-table pruning threshold was relaxed, fewer sentences could be decoded due to the memory limitations.</Paragraph> <Paragraph position="7"> Although our decoder performed better than the a35 is the system output length, and a72 is the reference length. IBM system in the BLEU score, the obtained gain was less than what we expected. We have thought the following three reasons. First, the syntax of Chinese is not extremely different from English, compared with other languages such as Japanese or Arabic. Therefore, the TM could not take advantage of syntactic reordering operations. Second, our decoder looks for a decoded tree, not just for a decoded sentence. Thus, the search space is larger than IBM models, which might lead to more search errors caused by pruning. Third, the LM used for our system was exactly the same as the LM used by the IBM system. Decoding performance might be heavily influenced by LM performance. In addition, since the TM assumes an English parse tree as input, a trigram LM might not be appropriate. We will discuss this point in the next section.</Paragraph> <Paragraph position="8"> Phrasal translation worked pretty well. Figure 3 shows the top-20 frequent phrase translations observed in the Viterbi alignment. The leftmost column shows how many times they appeared. Most of them are correct. It even detected frequent sentence-to-sentence translations, since we only imposed a relative length limit for phrasal translations (Section 3). However, some of them, such as the one with (in cantonese), are wrong. We expected that these junk phrases could be eliminated by phrase pruning (Section 5), however the junk phrases present many times in the corpus were not effectively filtered out.</Paragraph> </Section> <Section position="8" start_page="2" end_page="2" type="metho"> <SectionTitle> 7 Decoded Trees </SectionTitle> <Paragraph position="0"> The BLEU score measures the quality of the decoder output sentences. We were also interested in the syntactic structure of the decoded trees. The leftmost tree in Figure 4 is a decoded tree from the syn-nozf system. Surprisingly, even though the decoded sentence is passable English, the tree structure is totally unnatural. We assumed that a good parse tree gives high trigram probabilities. But it seems a bad parse tree may give good trigram probabilities too. We also noticed that too many unary rules (e.g. &quot;NPB a204 PRN&quot;) were used. This is because the reordering probability is always 1.</Paragraph> <Paragraph position="1"> To remedy this, we added CFG probabilities (PCFG) in the decoder search, i.e., it now looks for a tree which maximizes Pa4 trigrama6 Pa4 cfga6 Pa4 TMa6 . The CFG probability was obtained by counting the rule frequency in the parsed English side of the training corpus. The middle of Figure 4 is the output for the same sentence. The syntactic structure now looks better, but we found three problems. First, the BLEU score is worse (0.078). Second, the decoded trees seem to prefer noun phrases. In many trees, an entire sentence was decoded as a large noun phrase.</Paragraph> <Paragraph position="2"> Third, it uses more frequent node reordering than it should.</Paragraph> <Paragraph position="3"> The BLEU score may go down because we weighed the LM (trigram and PCFG) more than the TM. For the problem of too many noun phrases, we thought it was a problem with the corpus. Our training corpus contained many dictionary entries, and the parliament transcripts also included a list of participants' names. This may cause the LM to prefer noun phrases too much. Also our corpus contains noise. There are two types of noise. One is sentence alignment error, and the other is English parse error. The corpus was sentence aligned by automatic software, so it has some bad alignments. When a sentence was misaligned, or the parse was wrong, the Viterbi alignment becomes an over-reordered tree as it picks up plausible translation word pairs first and reorders trees to fit them.</Paragraph> <Paragraph position="4"> To see if it was really a corpus problem, we selected a good portion of the corpus and re-trained the r-table. To find good pairs of sentences in the corpus, we used the following: 1) Both English and Chinese sentences end with a period. 2) The English word is capitalized at the beginning. 3) The sentences do not contain symbol characters, such as colon, dash etc, which tend to cause parse errors. 4) The Viterbi-ratio8 is more than the average of the pairs which satisfied the first three conditions.</Paragraph> <Paragraph position="5"> Using the selected sentence pairs, we retrained only the r-table and the PCFG. The rightmost tree in Figure 4 is the decoded tree using the re-trained TM. The BLEU score was improved (0.085), and the tree structure looks better, though there are still problems. An obvious problem is that the goodness of syntactic structure depends on the lexical choices. For example, the best syntactic structure is different if a verb requires a noun phrase as object than it is if it does not. The PCFG-based LM does not handle this.</Paragraph> <Paragraph position="6"> At this point, we gave up using the PCFG as a component of the LM. Using only trigrams obtains the best result for the BLEU score. However, the BLEU metric may not be affected by the syntactic aspect of translation quality, and as we saw in Figure 4, we can improve the syntactic quality by introducing the PCFG using some corpus selection techniques. Also, the pruning methods described in Section 5 use syntactic statistics from the training corpus. Therefore, we are now investigating more sophisticated LMs such as (Charniak, 2001) which 8Viterbi-ratio is the ratio of the probability of the most plausible alignment with the sum of the probabilities of all the alignments. Low Viterbi-ratio is a good indicator of misalignment or parse error.</Paragraph> <Paragraph position="7"> the search (middle). The r-table was re-trained and PCFG was used (right). Each tree was back reordered and is shown in the English order.</Paragraph> <Paragraph position="8"> incorporate syntactic features and lexical information. null</Paragraph> </Section> class="xml-element"></Paper>