File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1001_metho.xml
Size: 22,227 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1001"> <Title>A Projection Extension Algorithm for Statistical Machine Translation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Block Generation Algorithm </SectionTitle> <Paragraph position="0"> Starting point for the block generation algorithm is a word alignment obtained from an HMM Viterbi training (Vogel et al., 1996). The HMM Viterbi training is carried out twice with English as target language and Chinese as source language and vice versa. We obtain two alignment relations:</Paragraph> <Paragraph position="2"> a82a98a85a47a80a99a87a100a89 is an alignment function from source to target positions and a56a97a85a84a89a101a87a102a80 is an alignment function from target to source positions 1. We compute the union and the intersection of the two alignment relations a75 a64 and a75a103a93 :</Paragraph> <Paragraph position="4"> We call the intersection relation a104 , because it represents a high-precision alignment, and the union alignment a107 , because it is taken to be a lower precision higher recall alignment (Och and Ney, 2000).</Paragraph> <Paragraph position="5"> The intersection a104 is also a (partial) bijection between the target and source positions: it covers the same number of target and source positions and there is a bijection between source and target positions that are covered. For the CE experiments reported in Section 4 about a109 a70 % of the target and source positions are covered by word links in a104 , for the AE experiments about a110 a70 % are covered. The extension algorithm presented assumes that a104a112a111a113a107 , which is valid in this case since a104 and a107 are derived from intersection and union. We introduce the following additional piece of notation:</Paragraph> <Paragraph position="7"> a58 is the set of all source positions that are covered by some word links in a104 , where the source positions are shown along the a80 -axis and the target positions are shown along the a89 -axis. To derive high-precision block links from the high-precision word links, we use the following projection definitions:</Paragraph> <Paragraph position="9"> Here, a121 a83 a54a131a130a132a58 projects source intervals into target intervals. a121 a95a96a54a131a130a132a58 projects target intervals into source intervals and is defined accordingly. Starting from the high-precision word alignment a104 , we try to derive a high-precision block alignment: we project source intervals a122a80 a123 a81a23a80a9a124 , where a80 a123a81a23a80a133a119 a114a74a115a15a116 a54a104 a58 . We compute the minimum target index a89 a123 and maximum target index a89 for the word links a53a134a119 a104 that fall into the are learned from projecting three source intervals.</Paragraph> <Paragraph position="10"> The right picture shows three blocks that cannot be obtain from source interval projections .</Paragraph> <Paragraph position="11"> interval a146a147 a148a150a23a147a92a151 . This way, we obtain a mapping of source intervals into target intervals:</Paragraph> <Paragraph position="13"> The approach is illustrated in Figure 2, where in the left picture, for example, the source interval a146a191a190a15a150a60a192a184a151 is projected into the target interval a146a191a190a15a150a60a192a184a151 . The pair</Paragraph> <Paragraph position="15"> a151a25a159 defines a block alignment link a186 . We use this notation to emphasize that the identity of the words is not used in the block learning algorithm. To denote the block consisting of the target and source words at the link positions, we write a193 where a198 a173 denote target words and a200 a189 denote source words. a194 a158a131a199a132a159 denotes a function that maps intervals lie on the frontier of a block. Additional word links may be inside the block.</Paragraph> <Paragraph position="16"> to the words at these intervals. The algorithm for generating the high-precision block alignment links is given in Table 1. The order in which the source intervals are generated does not change the final link set.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Block Extension Algorithm </SectionTitle> <Paragraph position="0"> Empirically, we find that expanding the high-precision block links significantly improves performance. The expansion is parameterised and described below. For a block link a186 a143 a158a23a146a147 a148 a150a23a147a9a151a59a150a171a146a161 a148 a150 a161 a151a25a159 , we compute its frontier a198a149a201a9a158a152a186a69a159 by looking at all word links that lie on one of the four boundary lines of a block. We make the following observation as shown in Figure 3: the number of links (filled dots in the picture) on the frontier a198a149a201a14a158a152a186a14a159 is less or equal a202 , since in every column and row there is at most one link in a139 , which is a partial bijetion. To learn blocks from a general word alignment that is not a bijection more than a202 word links may lie on the frontier of a block, but to compute all possible blocks, it is sufficient to look at all possible quadruples of word links. We extend the links on the frontier by links of the high-recall alignment a203 , where we use a parameterised way of locally extending a given word link. We compute an extended link set a204 by extending each word link on the frontier separately and taking the union of the resulting links. The way a word link is extended is illustrated in Figure 4. The filled dot in the center of the picture is an element of the high-precision set a139 . Starting from this link, we look for extensions in its neighborhood that lie in a213 , where the neighborhood is defined by a cell width parameter a222 and a distance parameter a223 . For instance, link a224a42a225 in Figure 4 is reached with cell width a222a195a226a228a227 and distance a223a7a226a229a227 , the link a224a25a230 is reached with a222a231a226a229a227 and a223a160a226a229a232 , the link a224a185a233 is reached with a222a234a226a235a232 and a223a94a226a229a236 . The word link a224 is added to a237 and it is itself extended using the same scheme. Here, we never make use of a row or a column covered by a212 other than the rows a238 and a238a96a239 and the columns a240 and a240a47a239 . Also, we do not cross such a row or column using an extension with a223a94a241a234a232 : this way only a small fraction of the word links in a213 is used for extending a single block link. The extensions are carried out iteratively until no new alignment links from a213 are added to a237 . The block extension algorithm in Table 2 uses the extension set a237 to generate all word link quadruples: the and a243a126a246a79a240 function compute the minimum and the maximum of a247 integer values.</Paragraph> <Paragraph position="1"> input: Block link a246a44a226a249a248a23a250a240a149a239a152a251a23a240a9a252a59a251a171a250a238a96a239a152a251a23a238a84a252a25a253 if ( a246a31a30a234a242 a226a249a248a23a250a10 a239 a251 a10 a252a59a251a171a250a28 a239 a251 a28 a252a25a253a23a253 a254a142a255a226 a254a33a32 a242 output: Extended block link set a254 .</Paragraph> <Paragraph position="2"> ple is generated and a check is carried out whether a242 includes the seed block link a246 . The following definition for block link inclusion is used:</Paragraph> <Paragraph position="4"> where the block a246a96a239a97a226a39a248a23a250a10 a239a152a251 a10 a252a59a251a171a250a28 a239a42a251 a28 a252a25a253 is said to be included in a246a249a226 a248a23a250a240 a239 a251a23a240a9a252a59a251a171a250a238 a239 a251a23a238a84a252a25a253 . a250a28 a239 a251 a28 a252a35a30a77a250a238 a239a251a23a238a79a252 holds iff a28 a239 a241a229a238a96a239 and a28a37a36 a238 . The 'seed' block link a246 is extended 'outwardly': all extended blocks a242 include the high-precision block a246 . The block link a246 itself may be included in other high-precision block links a246 a239 on its part, but a246a38a30a228a242a39a30a142a246 a239 holds. An extended block a242 derived from the block a246 never violates the projection restriction relative to a212 i.e., we do not have to re-check the projection restriction for any generated block, which simplifies and fastens up the generation algorithm. The approach is illustrated in Figure 5, where a high-precision block with a236 elements on its frontier is extended by two blocks containing it.</Paragraph> <Paragraph position="5"> The block link extension algorithm produces block links that contain new source and target intervals a250a240 a239 a251a23a240a92a252 and a250a238 a239a251a23a238a79a252 that extend the interval mapping in Eq. 3. This mapping is no longer a function, but rather a relation between source and target intervals i.e., a single source interval is mapped to several target intervals and vice versa. The extended block set constitutes a subset of the following set of interval pairs: a40a22a41a17a42a43a6a44a46a45a17a43a19a47a48a45a49a42a50a51a44a52a45a17a50a51a47a46a53a38a54a56a55a34a57a58a41a17a42a43a19a44a59a45a17a43a58a47a46a53a61a60a62a42a50a51a44a46a45a17a50a51a47a64a63 The set of high-precision blocks is contained in this set. We cannot use the entire set of blocks defined by all pairs in the above relation, the resulting set of blocks cannot be handled due to memory restrictions, which motivates our extension algorithm. We also tried the following symmetric restriction and tested the resulting block set: a55a65a57a19a41a17a42a43 a44 a45a17a43a19a47a46a53a61a60a62a42a50a4a45a17a50a51a47 and a55a34a66a67a41a17a42a50 a44 a45a17a50a51a47a46a53a61a60a62a42a43 a44 a45a17a43a19a47 (4) The modified restriction is implemented in the context of the extension scheme in Table 1 by inserting an if statement before the alignment link a68 is extended: the alignment link is extended only if the restriction a55a69a66a67a41a17a42a50 a44 a45a17a50a70a47a46a53a71a60a72a42a43 a44 a45a17a43a58a47 also holds. Considering only block links for which the two way projection in Eq. 4 holds has the following interesting interpretation: assuming a bijection a73 that is complete i.e., all source and target positions are covered, an efficient block segmentation algorithm exists to compute a Viterbi block alignment as in Figure 1 for a given training sentence pair. The complexity of the algorithm is quadratic in the length of the source sentence. This dynamic programming technique is not used in the current block selection but might be used in future work.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Unigram Block Selection </SectionTitle> <Paragraph position="0"> For selecting blocks from the candidate block links, we restrict ourselves to block links where target and source phrases are equal or less than a74 words long.</Paragraph> <Paragraph position="1"> This way we obtain some tens of millions of blocks on our training data including blocks that occur only once. This baseline set is further filtered using the experiments, we use the a75a79a82 restriction as our baseline, and for the Arabic-English experiments the a75a79a83 restriction. Blocks where the target and the source clump are of length a84 are kept regardless of their as relative frequency over all selected blocks.</Paragraph> <Paragraph position="2"> An example of a91 blocks obtained from the Chinese-English training data is shown in Figure 6. '$DATE' is a placeholder for a date expression. Block a76 a99 contains the blocks a76 a92 to a76 a97 . All a91 blocks are selected in training: the unigram decoder prefers a76a77a99 even if a76 a92 ,a76 a94 , and a76 a97 are much more frequent. The solid word links are word links in a73 , the striped word links are word links in a100 . Using the links in a100 , we can learn one-to-many block translations, e.g. the pair (a101 a92 ,'Xinhua news agency') is learned from the training data.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 DP-based Decoder </SectionTitle> <Paragraph position="0"> We use a DP-based beam search procedure similar to the one presented in (Tillmann and Ney, 2003).</Paragraph> <Paragraph position="1"> We maximize over all block segmentations a76a77a102a92 for which the source phrases yield a segmentation of the input source sentence, generating the target sentence simultaneously. The decoder processes search states of the following form: a44 are the two predecessor words used for the trigram language model, a108 is the so-called coverage vector to keep track of the already processed source position, a110 is the last processed source position. a112 is the source phrase length of the block currently being matched. a137a48a138 is the length of the initial fragment of the source phrase that has been processed so far. a137 a138 is smaller or equal a137 : a137 a138a140a139 a137 . Note, that the partial hypotheses are not distinguished according to the identity of the block itself. The decoder processes the input sentence 'cardinality synchronously': all partial hypotheses that are active at a given point cover the same number of input sentence words. The same beam-search pruning as described in (Tillmann and Ney, 2003) is used. The so-called observation pruning threshold is modified as follows: for each source interval that is being matched by a block source phrase at most the best a127a141a128 target phrases according to the joint unigram probability are hypothesized. The list of blocks that correspond to a matched source interval is stored in a chart for each input sentence. This way the matching is carried out only once for all partial hypotheses that try to match the same input sentence interval.</Paragraph> <Paragraph position="2"> In the current experiments, decoding without block re-ordering yields the best translation results. The decoder translates about a127a141a142a96a143 words per second.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Chinese-English Experiments </SectionTitle> <Paragraph position="0"> The translation system is tested on a Chinese-to-English translation task. For testing, we use the DARPA/NIST MT 2001 dry-run testing data, which consists of a144a96a145a7a146 sentences with a128a96a143 a146a7a146a7a146 words arranged in a142a96a143 documents 3. The training data is provided by the LDC and labeled by NIST as the Large Data condition for the MT 2002 evaluation. The Chinese sentences are segmented into words. The training data contains a128 a146a67a149a150a144 million Chinese and a128a7a151 a149a152a146 million English words. The block selection algorithm described below runs less than one hour on a single a127 -Gigahertz linux machine.</Paragraph> <Paragraph position="1"> Table 3 presents results for various block extension schemes. The first column describes the extension scheme used. The second column reports the total number of blocks in millions collected - including all the blocks that occurred only once. The third column reports the number of blocks that occurred at least twice. These blocks are used to compute the results in the fourth column: the BLEU score (Papineni et al., 2002) with a153 reference translation using a153 -grams along with 95% confidence interval is reported 4. Line a127 and line a128 of this table show results where only the source interval projection without any extension is carried out. For the a130a35a131 a119a131 extension scheme, the high-recall union set itself is used for projection. The results are worse than for all other schemes, since a lot of smaller blocks are discarded due to the projection approach. The a117 a131 a119a131 scheme, where just the a117 word links are used is too restrictive leaving out bigger blocks that are admissible according to a117 . For the Chinese-English test data, there is only a minor difference between the different extension schemes, the best results are obtained for the a117a154a133 a119a133 and the a117a155a133 a119a135 extension schemes. Table 4 shows the effect of the unigram selection threshold, where the a117a56a133 a119a135 blocks are used. The second column shows the number of blocks selected.</Paragraph> <Paragraph position="2"> The best results are obtained for the a121a39a128 and the a121 a146 4The test data is split into a certain number of subsets. The BLEU score is computed on each subset. We use the t-test to compare these scores.</Paragraph> <Paragraph position="3"> sets. The number of blocks can be reduced drastically where the translation performance declines only gradually.</Paragraph> <Paragraph position="4"> Table 5 shows the effect of the maximum phrase length on the BLEU score for the a156a79a157 block set. Including blocks with longer phrases actually helps to improve performance, although already a length of a158 obtains nearly identical results.</Paragraph> <Paragraph position="5"> We carried out the following control experiments (using a156a37a159a52a160a77a161a61a162a163a157 as threshold): we obtained a block set of a164a67a165a152a166a168a167 million blocks by generating blocks from all quadruples of word links in a169 5. This set is a proper superset of the blocks learned for the a169a155a170a98a171a170 experiment in Table 3. The resulting BLEU score is a172a168a165a15a167a141a173a111a174 . Including additional smaller blocks even hurts translation performance in this case. Also, for the extension scheme a169a154a175 a171a176 , we carried out the inverse projection as described in Section 2.1 to obtain a block set of a157a67a165a152a173a7a177 million blocks and a BLEU score of a172a168a165a15a167a141a178a7a173 . This number is smaller than the BLEU score of a172a168a165a15a167a141a177a7a164 for the a169 a175 a171a176 restriction: for the translation direction Chinese-to-English, selecting blocks with longer English phrases seems to be important for good translation performance. It is interesting to note, that the unigram translation model is symmetric: the translation direction can be switched to English-to-Chinese without re-training the model just a new Chinese language model is needed. Our experiments, though, show that there is an unbalance with respect to the projection direction that has a significant influence on the translation results. Finally, we carried out an experiment where we used the a169a155a170a98a171a170 block set as a baseline. The extension algorithm was applied only to blocks of target and source length a167 producing one-to-many translations, e.g. the blocks a160a49a175 and a160 a176 in Figure 6. The BLEU score improved to a172a168a165a15a167a141a178a7a178 with a block set of a164a67a165a15a167a77a172 million blocks. It seems to be important to carry out the block extension also for larger blocks.</Paragraph> <Paragraph position="6"> We also ran the N2 system on the June 2002 DARPA TIDES Large Data evaluation test set. Six research sites and four commercial off-the-shelf systems were evaluated in Large Data track. A majority of the systems were phrase-based translation systems. For comparison with other sites, we quote the the BLEU score. Both target and source phrase are shorted than the maximum. The unigram threshold</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Arabic-English Experiments </SectionTitle> <Paragraph position="0"> We also carried out experiments for the translation direction Arabic to English using training data from UN documents. For testing, we use a test set of a167a49a174a7a174 sentences with a173a67a184a93a177a7a157a7a178 words arranged in a167a141a166 documents The training data contains a167a141a157a7a157a67a165a185a172 million Arabic and a166a7a173a67a165a152a177 million English words. The training data is pre-processed using some morphological analysis. For the Arabic experiments, we have tested the a164 extension schemes a169a186a170a98a171a170 , a169 a175 a171a175 , and a169 a175 a171a176 as shown in Table 6. Here, the results for the different schemes differ significantly and the a169a187a175 a171a176 scheme produces the best results. For the AE experiments, only blocks up to a phrase length of a178 are computed due to disk memory restrictions. The training data is split into several chunks of a164a96a172a7a172a168a184a188a172a7a172a7a172 training sentence pairs each, and the final block set together with the unigram count is obtained by merging the block files for each of the chunks written onto disk memory. The word-to-word alignment is trained using a189 iterations of the IBM Model a190 training followed by a189 iterations of the HMM Viterbi training. This training procedure takes about a day to execute on a single machine. Additionally, the overall block selection procedure takes about a191a67a192a189 hours to execute.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Previous Work </SectionTitle> <Paragraph position="0"> Block-based translation units are used in several papers on statistical machine translation. (Och et al., 1999) describe the alignment template system for statistical MT: alignment templates correspond to blocks that do have an internal structure. Marcu and Wong (2002) use a joint probability model for blocks where the clumps are contiguous phrases as in this paper. Yamada and Knight (2002) presents a decoder for syntax-based MT that uses so-called phrasal translation units that correspond to blocks.</Paragraph> <Paragraph position="1"> Block unigram counts are used to filter the blocks.</Paragraph> <Paragraph position="2"> The phrasal model is included into a syntax-based model. Projection of phrases has also been used in (Yarowsky et al., 2001). A word link extension algorithm similar to the one presented in this paper is given in (Koehn et al., 2003).</Paragraph> </Section> class="xml-element"></Paper>